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Abstract 

We  show  that  if  performance  measures  in  stochastic  and  dynamic  scheduling  problems  sat- 
isfy generalized  conservation  laws,  then  the  feasible  space  of  achievable  performance  is  a 
polyhedron  called  an  extended  polymatroid  that  generalizes  the  usual  polymatroid^s  intro- 
duced by  Edmonds.  Optimization  of  a  linear  objective  over  an  extended  polymatroid  is 
solved  by  an  adaptive  greedy  algorithm,  which  leads  to  an  optimal  solution  having  an  in- 
dexability  property  {indexable  systems).  Under  a  certain  condition,  then  the  indices  have 
a  stronger  decomposition  property  {decomposable  systems).  The  following  classical  prob- 
lems can  be  analyzed  using  our  theory:  multi-armed  bandit  problems,  branching  bandits. 
multiclass  queues,  multiclass  queues  with  feedback,  deterministic  scheduling  pnDlilpnis  In- 
teresting consequences  of  our  results  include:  ( 1)  a  characterization  of  indexable  systems  a-s 
systems  that  satisfy  generalized  conservation  laws,  (2)  a  sufficient  condition  for  indexable 
systems  to  be  decomposable,  (3)  a  new  linear  programming  proof  of  the  decomposabilit} 
property  of  Gittins  indices  in  multi-armed  bandit  problems,  (4)  a  unified  and  practical  ap- 
proach to  sensitivity  analysis  of  indexable  systems.  (-5)  a  new  characterization  of  the  indu'^s 
of  indexable  systems  as  sums  of  dual  variables  and  a  new  interpretation  of  the  indices  in 
terms  of  retirement  options  in  the  context  of  branching  bandits,  (6)  the  first  rigorous  anal- 
ysis of  the  indexability  of  undiscounted  branching  bandits,  (7)  a  new  algorithm  to  compute 
the  indices  of  indexable  systems  (in  particular  Gittins  indices),  which  is  as  fast  as  the  fastest 
known  algorithm,  (8)  a  unification  of  the  algorithm  of  Klimov  for  multiclass  queues  and 
the  algorithm  of  Gittins  for  multi-armed  bandits  as  special  cases  of  the  same  algorithm.  (9) 
closed  form  formulae  for  the  performance  of  the  optimal  policy,  and  (10)  an  understanding 
of  the  nondependence  of  the  indices  on  some  of  the  parameters  of  the  stochastic  scheduling 
problem.  Most  importantly,  our  approach  provides  a  unified  treatment  of  several  classical 
problems  in  stochastic  and  dynamic  scheduling  and  is  able  to  address  in  a  unified  wax  their 
variations  such  as:  discounted  versus  undiscounted  cost  criterion,  rewards  versus  taxes. 
preemption  versus  nonpreemption,  discrete  versus  continuous  time,  work  conserving  versus 
idling  policies,  linear  versus  nonlinear  objective  functions. 


1      Introduction 

In  the  mathematical  programming  tradition  researchers  and  practitioners  solve  optinuza- 
tion  problems  by  defining  decision  variables  and  formulating  constraints,  thus  describing  the 
feasible  space  of  decisions,  and  applying  algorithms  for  the  solution  of  tiie  underlying  opti- 
mization problem.  For  the  most  part,  the  tradition  for  stochastic  and  dynamic  scheduling 
problems  has  been,  however,  quite  different,  cis  it  relies  primarily  on  dynamic  prograninimg 
formulations.  Using  ingenious  but  often  ad  hoc  methods,  which  exploit  the  structure  of  the 
particular  problem,  researchers  and  practitioners  can  sometimes  derive  insightful  structural 
results  that  lead  to  efficient  algorithms.  In  their  comprehensive  survey  of  deterniiuistic 
scheduling  problems  Lawler  et.  al.  [23]  end  their  paper  with  the  following  remarks:  The 
results  in  stochastic  scheduling  are  scattered  and  they  have  been  obtained  through  a  con- 
siderable and  sometimes  dishearting  effort.  In  the  words  of  Coffman,  Hofri  and  Weiss  [8]. 
there  is  great  need  for  new  mathematical  techniques  useful  for  simplifying  the  dernation  of 
the  results". 

Perhaps  one  of  the  most  important  successes  in  the  area  of  stochastic  scheduling  in  the 
last  twenty  years  is  the  solution  of  the  celebrated  mulit-armed  bandit  problem,  a  g>^"iiPii<- 
version  of  which  in  discrete  time  can  be  described  as  follows: 

The  multi-armed  bandit  problem:  There  are  A'  parallel  projects,  indexed  k  =  I 
. . . ,  K.  Project  k  can  be  in  one  of  a  finite  number  of  states  ifc.  At  each  instant  of  discrete 
time  f  =  0,  I,  . . .  one  can  work  on  only  a  single  project.  If  one  works  on  project  k-  in  stau- 
ik(t)  at  time  <,  then  one  receives  an  immediate  expected  reward  of  R,^{t)  Rewards  ar.^ 
additive  and  discounted  in  time  by  a  factor  0  <  i?  <  1.  The  state  ik{t)  changes  to  if,{t  +  1 1 
by  a  Markov  transition  rule  (which  may  depend  on  k,  but  not  on  t),  while  the  states  of  ilie 
projects  one  has  not  engaged  remain  unchanged,  i.e..  ii{t+  1)  =  ii{t)  for  /  ^  k.  The  prnlij.-in 
is  how  to  allocate  one's  resources  sequentially  in  time  in  order  to  maximize  expected  imal 
discounted  reward  over  an  infinite  horizon. 

The  problem  has  numerous  applications  and  a  rather  vast  literature  (see  Gittiiis  [16] 
and  the  references  therein).  It  was  originally  solved  by  Gittins  and  Jones  [14],  who  proved 
that  to  each  project  k  one  could  attach  an  index  7'^(ijt(0)<  which  is  a  function  of  the  project 
k  and  the  current  state  ik{t)  alone,  such  that  the  optimal  action  at  time  t  is  to  engage  the 
project  of  largest  current  index.    They  also  proved  the  important  result  that  thesp  iiidf^x 


functions  satisfy  a  stronger  index  decomposHton  property:  the  function  -;  "(  )  only  depends 
on  characteristics  of  project  k  (states,  rewards  and  transition  probabihties),  and  not  on  an\ 
other  project.  These  indices  are  now  known  as  Gittins  indices,  in  recognition  of  Gitrms  con- 
tribution. Since  the  original  solution,  which  relied  on  an  interchange  argument,  other  proofs 
were  proposed:  Whittle  [36]  provided  a  proof  based  on  dynamic  programming,  subsequently 
simplified  by  Tsitsiklis  [30].  Varaiya,  Walrand  and  Buyukkoc  [33]  and  Weiss  [35]  provided 
different  proofs  based  on  interchange  arguments.  Weber  [34]  outlined  an  intuiti\e  proof. 
More  recently,  Tsitsiklis  [31]  has  provided  a  proof  based  on  a  simple  inductive  argument 

The  multi-armed  bandit  problem  is  a  special  case  of  a  dynamic  and  stochastic  job 
scheduling  system  S-  In  this  context,  there  is  a  set  E  of  job  types  and  we  are  interesti^d 
in  optimizing  a  function  of  a  performance  measure  (rewards  or  taxes)  under  a  riass  of 
admissible  scheduling  policies. 

Definition  1  (Indexable  Systems)  We  say  that  adynamic  and  stochastic  jo6  scheduling 
system  S  is  indexable  if  the  following  policy  is  optimal:  To  each  job  type  /  we  attach  an 
index,  7^.  At  each  decision  epoch  select  a  job  with  the  largest  index. 

In  general  the  indices  7,  could  depend  on  the  entire  set  E  of  job  types.  Consider  a  partition 
of  the  set  E  to  subsets  Ek,  k  =  1, . . .  /\  ,  which  contain  collections  of  job  types  and  can  lie 
interpreted  as  projects  consisting  of  several  job  types.  In  certain  situations,  the  index  of 
job  type  i  £  Ek  depends  only  on  the  characteristics  of  the  job  types  in  Ek  and  not  on  the 
entire  set  E  of  job  types.  Such  a  property  is  particularly  useful  computationally  sinci^  it 
enables  the  system  to  be  decomposed  to  smaller  parts  and  the  computation  of  the  indices 
can  be  done  independently.  As  we  have  seen  the  multi-armed  bandit  problem  has  this 
decomposition  property,  which  motivates  the  following  definition: 

Definition  2  (Decomposable  Systems)  An  indexable  system  is  called  decomposable  if 
for  all  job  types  i  £  Ek,  the  index  7,  of  job  type  i  depends  only  on  the  characteristics  of  the 
set  of  job  types  Ek- 

In  addition  to  the  multi-armed  bandit  problem,  a  variety  of  dynamic  and  stochastic 
scheduling  problems  has  been  solved  in  the  last  decades  by  indexing  rules: 

1.  Extensions  of  the  usual  multi-armed  bandit  problem  such  as  arm-acquiring  bandits 
(Whittle  [37],  [38])  and  more  generally  branching  bandits  (Weiss  [35]).  that  include 
several  important  problems  as  special  cases. 
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2.  The  multiclass  queueing  scheduling  problem  with  Bernoulli  feedback  (Kliiiun    [22]. 
Tcha  and  Pliska  [29]). 

3.  The  multiclass  queueing  scheduling  problem  without  feedback  (Cox  and  Smith  [9]. 
Harrison  [19],  Kleinrock  [21],  Gelenbe  and  Mitrani  [13],  Shantikumar  and  Yao  [26]). 

4.  Deterministic  scheduling  problems  (Smith  [27]). 

An  interesting  distinction,  which  is  not  emphasized  in  the  literature,  is  that  examples 
(1)  and  (2)  above  are  indexable  systems,  but  they  are  not  in  general  decomposable  systems. 
Example  (3),  however,  has  a  more  refined  structure.  It  is  indexable,  but  not  decomposable. 
under  discounting,  while  it  is  decomposable  under  the  average  cost  criterion  (the  r//  rule). 
As  already  observed,  the  multi-armed  bandit  problem  is  an  example  of  a  decomposable 
system,  while  example  (4)  above  is  also  decomposable. 

Faced  with  these  results,  one  asks  what  is  the  underlying  deep  reason  that  these  non- 
trivial  problems  have  very  efficient  solutions  both  theoretically  as  well  as  practically  In 
particular,  what  is  the  class  of  stochastic  and  dynamic  scheduling  problems  that  ar^  uidex- 
able?  Under  what  conditions,  indexable  systems  are  decomposable''  But  most  im]3oi-taiul_\ 
is  there  a  unified  way  to  address  stochastic  and  dynamic  scheduling  problems  that  will  lead 
to  a  deeper  understanding  of  their  strong  structural  properties''  This  is  the  set  of  questions 
that  motivates  this  work. 

In  the  last  decade  the  following  approach  has  been  proposed  to  address  special  cases 
of  these  questions.  In  broad  terms,  researchers  try  to  describe  the  feasible  space  of  a 
stochastic  and  dynamic  scheduling  problem  as  a  polyhedron.  Then,  the  stochastic  and 
dynamic  scheduling  problem  is  translated  to  an  optimization  problem  over  the  corresponding 
polyhedron,  which  can  then  be  attacked  by  traditional  mathematical  programming  methods. 
Coffman  and  Mitrani  [7]  and  Gelenbe  and  Mitrani  [13]  first  showed  using  conservation  laws 
that  the  performance  space  of  a  multiclass  queue  under  the  average  cost  criterion  can  be 
described  as  a  polyhedron.  Federgruen  and  Groenevelt  [1 1],  [12]  advanced  the  theory  further 
by  observing  that  in  certain  special  cases  of  multiclciss  queues,  the  polyhedron  has  a  very 
special  structure  (it  is  a  polymatroid)  that  gives  rise  to  very  simple  optimal  policies  (the  c^ 
rule).  Shantikumar  and  Yao  [26]  generalized  the  theory  further  by  observing  that  if  a  system 
satisfies  strong  conservation  laws,  then  the  underlying  performance  space  is  necessarily  a 
polymatroid.  They  also  proved  that,  when  the  cost  is  linear  on  the  performance,  the  optimal 


policy  is  a  fixed  priority  rule  (also  called  head  of  the  line  priority  rule;  see  Cobliam  [6].  and 
Cox  and  Smith  [9]).  Their  results  partially  extend  to  some  rather  restricted  queueing 
networks,  in  which  they  assume  that  all  the  different  classes  of  customers  liave  the  same 
routing  probabilities,  and  the  same  service  requirements  at  each  station  of  the  network  (see 
also  [25]).  Tsoucas  ([32])  derived  the  region  of  achievable  performance  in  the  problem  of 
scheduling  a  multiclass  nonpreemptive  M/G/1  queue  with  Bernoulli  feedback,  intioduced 
by  Klimov  ([22]).  Finally,  Bertsimjis  et  al.  [2]  generalize  the  ideas  of  conservation  laws 
to  general  multiclass  queueing  networks  using  potential  function  ideas  They  finil  linear 
and  nonlinear  inequalities  that  the  feasible  region  satisfies.  Optimization  over  this  set  of 
constraints  gives  bounds  on  achievable  performance. 

Our  goal  in  this  paper  is  to  propose  a  unified  theory  of  conservation  laws  and  to  establish 
that  the  very  strong  structural  properties  in  the  optimization  of  a  class  of  stochastic  and 
dynamic  systems  that  include  the  multi-armed  bandit  problem  and  its  extensions  follow  from 
the  corresponding  strong  structural  properties  of  the  underlying  poiyhedra  that  characterize 
the  regions  of  achievable  performance. 

By  generalizing  the  work  of  Shantikumar  and  Yao  [26]  we  show  that  if  performance 
measures  in  stochastic  and  dynamic  scheduling  problems  satisfy  generalized  consa-i  aiion 
laws,  then  the  feasible  space  of  achievable  performance  is  a  polyhedron  called  an  t  ri^itdtd 
polymatroid  (see  Bhattacharya  ei  at.  [4]).  Optimization  of  a  linear  objective  over  an  ex- 
tended polymatroid  is  solved  by  an  adaptive  greedy  algorithm,  which  leads  to  an  optimal 
solution  having  an  indexability  property.  Special  cases  of  our  theory  include  all  tl>^  proli- 
lems  we  have  mentioned,  i.e.,  multi-armed  bandit  problems,  discounted  and  undiscountpd 
branching  bandits,  multiclass  queues,  multiclass  queues  with  feedback  and  deterministic 
scheduhng  problems.  Interesting  consequences  of  our  results  include; 

1.  A  characterization  of  indexable  systems  as  systems  that  satisfy  generalized  conserva- 
tion laws. 

2.  Sufficient  conditions  for  indexable  systems  to  be  decomposable. 

3.  A  genuinely  new,  algebraic  proof  (based  on  the  strong  duality  theory  of  linear  pro- 
gramming as  opposed  to  dynamic  programming  formulations)  of  the  decomposability 
property  of  Gittins  indices  in  multi-armed  bandit  problems. 


4.  A  unified  and  practical  approach  to  sensitivity  analysis  of  indexable  systems.,  based 
on  the  well  understood  sensitivity  analysis  of  linear  programming 

5.  A  new  characterization  of  the  indices  of  indexable  systems  as  sums  of  dual  variables 
corresponding  to  the  extended  polymatroid  that  characterizes  the  feasible  space. 

6.  A  new  interpretation  of  indices  in  the  context  of  branching  bandits  as  retiiement 
options,  thus  generalizing  the  interpretation  of  Whittle  [36]  and  Weber  [34]  for  the 
indices  of  the  classical  multi-armed  bandit  problem. 

7.  The  first  complete  and  rigorous  analysis  of  the  indexability  of  undiscounted  branching 
bandits. 

8.  A  new  algorithm  to  compute  the  indices  of  indexable  systems  (in  particular  G:t- 
tins  indices),  which  is  as  fast  as  the  fastest  known  algorithm  (Varaiya.  \\alran(l  and 
Buyukkoc  [33]). 

9.  The  realization  that  the  algorithm  of  Klimov  for  multicla.ss  queues  and  the  algorithm 
of  Gittins  for  multi-armed  bandits  are  examples  of  the  same  algorithm. 

10.  Closed  form  formulae  for  the  performance  of  the  optimal  policy.  This  also  leads  to  an 
understanding  of  the  nondependence  of  the  indices  on  some  of  the  parameters  of  the 
stochastic  scheduling  problem. 

Most  importantly,  our  approach  provides  a  unified  treatment  of  several  classical  prob- 
lems in  stochastic  and  dynamic  scheduling  and  is  able  to  address  in  a  unified  way  their 
variations  such  bls:  discounted  versus  undiscounted  cost  criterion,  rewards  versus  taxes. 
preemption  versus  nonpreemption,  discrete  versus  continuous  time,  work  conserving  \ersus 
idling  policies,  linear  versus  nonlinear  objective  functions. 

The  paper  is  structured  as  follows:  In  Section  2  we  define  the  notion  of  generalized  con- 
servation laws  and  show  that  if  a  performance  vector  of  a  stochastic  and  dynamic  scheduling 
problem  satisfies  generalized  conservation  laws,  then  the  feasible  space  of  this  performance 
vector  is  an  extended  polymatroid.  Using  the  duality  theory  of  linear  programming  we 
show  that  linear  optimization  problems  over  extended  polymatroids  can  be  solved  by  an 
adaptive  greedy  algorithm.  Most  importantly,  we  show  that  this  optimization  problem  has 
an  indexability  property.    In  this  way,  we  give  a  characterization  of  indexable  systems  as 


systems  that  satisfy  generalized  conservation  laws.  We  also  find  a  sufficient  condition  for 
an  indexable  system  to  be  decomposable  and  prove  a  powerful  result  on  sensitivity  analysis. 
In  Section  3  we  study  a  natural  generalization  of  the  classical  multi-armed  bandit  problem: 
the  branching  bandit  problem.  We  propose  two  different  performance  measures  and  prove 
that  they  satisfy  generalized  conservation  laws,  and  thus  from  the  results  of  the  previous 
section  their  feaisible  space  is  an  extended  polymatroid.  We  then  consider  different  cost  and 
reward  structures  on  branching  bandits,  corresponding  to  the  discounted  and  undiscouiitpd 
case,  and  some  transform  results.  Section  4  contains  applications  of  the  previous  sections  to 
various  classical  problems:  multi-armed  bandits,  multi-class  queueing  scheduling  prol)lenis 
with  or  without  feedback  and  deterministic  scheduling  problems.  The  final  section  contains 
some  thoughts  on  the  field  of  optimization  of  stochastic  systems. 

2      Extended  Polymatroids  and  Generalized  Conservation  Laws 

2.1     Extended  Polymatroids 

Tsoucas  [32]  characterized  the  performance  space  of  Klimov's  problem  (see  Klimox  ['22])  .i^  a 
polyhedron  with  a  special  structure,  not  previously  identified  in  the  literature  Bhattachar\  n 
et  at.  [4]  called  this  polyhedron  an  extended  polymatroid  and  proved  some  interesting  pro|)- 
erties  of  it.  Extended  polymatroids  are  a  central  structure  for  the  results  we  present  in  this 
paper. 

Let  us  first  establish  the  notation  we  will  use.  Let  E  =  {!,... ,n}  be  a  finite  set  Lei  j- 
denote  a  real  n-vector,  with  components  i,,  for  /  £  E.  For  S  C  N,  let  5*^  =  f  \  5.  ami  ki 
\S\  denote  the  cardinality  of  5.  Let  2^  denote  the  class  of  all  subsets  of  E .  Let  6:  2^  —  h\ 
be  a  set  function,  that  satisfies  6(0)  =  0.  Let  A  =  (.4f  ),g£,  scE  be  a  matrix  that  satistip^ 

/if  >  0,       for     i  e  S  and  A^  =  0,       for     i  G  S" .         for  all  5  C  £"         (  1 ) 

Let  TT  =  (jri,  ...,7r„)  be  a  permutation  of  E.    For  clarity  of  presentation,  it  is  con\pnif^iit 

to  introduce  the  following  additional  notation.  For  an  n-vector  x  =  (x\ x„)     let  j-  = 

(i„j , . . . ,  i,r„)-^-  Let  us  write 

t.  =  (6({^i}).fc({Ti,7r2}),...,6({7ri,...,7r„})f. 


Let  An  denote  the  following  lower  triangular  submatrix  of  .4: 

0  ...  0 

0 


/      AiV^ 


A.= 


\Ai: 


{t1 T")  j{fl ^r,} 


'T2 


Ai'^i ^n} 


Let  i'(7r)  be  the  unique  solution  of  the  linear  system 


tiAiV "^^x,.  =fc({;r:,...,:r,}),      j=l,...,n 


1=1 


or,  in  matrix  notation: 


Let  us  define  the  polyhedron 


A^x^  =  b^. 


V{A,  6)  =  {  J  G  3?"  ;  ^  Afx,  >  b(S), 


for  5  C  r  } 


and  the  polytope 


(4) 


B(/l,6)=  {r  G3f"  :^-4fr,  >  6(S),       for  S  C  ^         and  ^.4fx,  =6(f)}        (J) 

Note  that  if  r  G  V(A,b),  then  it  follows  that  r  >  0  componentwise.  The  following  definition 
is  due  to  Bhattacharya  et  al.  [4]. 

Definition  3  (Extended  Polymatroid)  We  say  that  the  polyhedron  V(A.b)  is  an  f  r- 
tended  polymatrotd  with  base  set  E,  if  for  every  permutation  n  of  E.  v{tt)  G  V{A.h).  In 
this  Ceise  we  say  that  the  polytope  B(A,b)  is  the  base  of  the  extended  polymatroid  'P{A.Ij). 


2.2     Optimization  over  Extended  Polymatroids 

Extended  polymatroids  are  polyhedra  defined  by  an  exponential  number  of  inequalities. 
Yet,  Tsoucas  [32]  and  Bhattacharya  et  al.  [4]  presented  a  polynomial  algorithm,  based 
on  Klimov's  algorithm  (see  Klimov  [22])  for  solving  a  linear  programming  problem  over 
an  extended  polymatroid.  In  this  subsection  we  provide  a  new  duality  proof  that  this 
algorithm  solves  the  problem  optimally.  We  then  show  that  we  can  associate  with  this 
linear  program  certain  indices,  related  to  the  dual  program,  in  such  a  way  that  the  problem 
hcis  an  indexability  property.     Under  certain  conditions,  we  prove  that  a  stronger  index 


decomposition  property  holds.  We  also  present  an  optimality  condition  specially  suited  for 
performing  sensitivity  analysis. 

In  what  follows  we  assume  that  V{A,b)  is  an  extended  polymatroid     Let  R  £  K"^^  be  a 
row  vector.  Let  us  consider  the  following  linear  programming  problem; 

(P)  max{^i2,a;,  :iGe(A,6)}.  (6) 

Note  that  since  B{A,b)  is  a  polytope,  this  linear  program  has  a  finite  optimal  solution. 
Therefore  we  may  consider  its  dual,  and  this  will  have  the  same  optimum  value.  We  shall 
have  a  dual  variable  y    for  every  S  C  E.  The  dual  problem  is: 


(D)  mm{  ^  b(S)y^  :  ^  Afy^  =  R,.      for  /  G  E,  and         y-   <  0,      for  S  C  E}. 

SCE  SBi 

(7) 

In  order  to  solve  (P),  Bhattacharya  ef  al.    [4]  presented  the  following  adapftie  giffdy 
algorithm,  based  on  Klimov's  algorithm  [22]: 

Algorithm  ^i 

Input:  {R,A). 

Output:  (n,y,i^,S),  where  tt   =   (ttj 7r„)  is  a  permutation  of  E,  y  =   (y'^)<;cE-   '■'   — 

(t/i,...  ,f„),  and  S  =  {Si, 5„},  with  5jt  =  {tti,  .  .  .  ,7r/;},  for  k  £  E. 

Step  0.  Set  Sn  =  E.  Set  i/„  =  max{  -^  .  i  £  E); 
pick  7r„  G  argmax{  -^  :  f  G  £■  }. 

5<ep  i.   For  t  =  1,  ...,  n-  1: 

Set  S„_fc  =  S„_fc+i  \  {TTn-k+x  };  set  i/„_A:  =  max{    '^°5^'_^ — ^        '  €  Sn-k  ) 

■1, 


pick     7r„_t    G   argmax{         ^'=1  ' — ^^^    :    ?  G  Sn-k  } 


,-„_* 


5<ep  n.  For  5  C  £"  set 

{fj ,     if  5  =  6j  for  some  j  G  £'; 
0,       otherwise. 


It  is  easy  to  see  that  the  complexity  of  Ai,  given  {R,  A),  is  0{n'^).  Note  that,  for  certain 
reward  vectors,  ties  may  occur  in  algorithm  ^i .  In  the  presence  of  ties,  the  permutation 
7r  generated  depends  clearly  on  the  choice  of  tie-breaking  rules.  Howmer.  we  will  show 
that  vectors  f  and  y  are  uniquely  determined  by  A\.  In  order  to  prove  this  point,  whose 
importance  will  be  clear  later,  and  to  understand  better  ^i,  let  us  introduce  the  following 
related  algorithm: 

Algorithm  A2 
Input:  {R,A). 

Output:  {r,y,'H.J),  where  I  <  r  <  n  is  an  integer,  y  =  {y   )scE'  ^  =  {^1 //r}  is  a 

partition  of  E,  and  J^  =  U[_jt///,  for  /  =  1, . . .  , »-. 

Step  1.   Set  k  :—  1;  set  Jj  =  E\ 

set  6\  =  max{  -jf  ■  i  ^  E)       and     H\  =  argmax{  -^  :  '  G  £  } 


Step  2.   While  Jk  f  H^  do: 

begin 

Set  t  :=  fc  +  1;  set  Jk  =  Jk-\  \  Hk-\\ 

set  ek  =  max{  ^'"^^  "''"'  :  /  g  Jfe  }      and     //,  =  argmax{  ^'"^?  '"'''''  :  ;  g  h  ]■ 
A,  *  .4,  * 

end  {while} 

Step  3.   Set  r  =  t; 

for  S  C  E  set 

^it ,     if  S  =  Jfc  for  some  k  =  1 , . . . ,  r; 

0,       otherwise. 

In  what  follows  let  (7r,y,  i/,5)  be  an  output  of  v4i  and  let  (r,y,Ti,J)  he  the  output  of 
A2-  Note  that  the  output  of  algorithm  ^2  is  uniquely  determined  by  its  input. 

The  idea  that  algorithm  A2  is  just  an  unambiguous  version  of  Ai  is  formalized  in  the 
following  result: 

Proposition  1    The  following  relations  hold  between  the  outputs  of  algorithms  A\   and  A2: 
(a)  for  1=1,..., n 

{6k,     i/'  =  \Jk\  for  some  k  =  1, . . .  ,  r; 
(S) 
0,       otherwise: 


=S 

y 


(b)  y  =  V; 

(c)  n  satisfies 

Jk  =  {^i ^\J,\}.      ^=1 r.  (9) 

ffk  =  {t^\j,\-\h,\+i^--,^\j,\]-     t  =  l r-  (10) 

Outline  of  the  proof 

Parts  (a)  and  (c)  follow  by  induction  arguments.  Part  (b)  follows  by  (a)  and  the  definitions 
of  y  and  y.  □ 

Remark:  Proposition  1  shows  that  y  and  v  are  uniquely  determined  (and  thus  invariant 
under  different  tie-breaking  rules)  by  algorithm  Ai  It  also  reveals  in  (c)  the  structure  of 
the  permutations  tt  that  can  be  generated  by  ^i. 

Tsoucas  [32]  and  Bhattacharya  et  al.  [4]  proved  from  first  principles  that  algorithm  A\ 
solves  linear  program  (P)  optimally.  Next  we  provide  a  new  proof,  using  linear  programming 
duality  theory. 

Proposition  2  Let  vector  y  and  permutation  rr  be  generated  by  algorithm  A\  Th>.n  i(t) 
and  y  are  an  optimal  primal-dual  pair  for  tlte  linear  programs  (P)  and(D). 

Proof 

We  first  show  that  y  is  dual  feasible.  By  definition  of  u^  in  ^i ,  it  follows  that 

and  since  5n-i  C  5„  it  follows  that  !/„_!  <  0 

Similarly,  for  it  =  1,. . . ,  n  —  2,  by  definition  of  i^n-k  it  follows  that 

k 

j=0 

and  since  Sn-k-\  C  Sn-k,  it  follows  that  fn-t-i  <  0.  Hence  Uj  <  0,  for  j  =  1,  .  .  c  -  1. 
and  by  definition  of  y,  we  have  y^  <  0,  for  5  C  t" 

Moreover,  for  ib  =  0,  1,  . .  .  ,  n  —  1  we  have,  by  construction. 

Hence  y  is  dual  feasible. 
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Let  I  =  t'(7r).  Since  V[A,b)  is  an  extended  polymatroid,  x  is  primal  feasible.  Let  us 
show  that  1  and  y  satisfy  complementary  slackness.  Assume  y^  ^  0.  Then,  by  const riirt ion 
it  must  he  S  =  Sk  =  { ""i  >  ■■  ,  ""A.- } ,  for  some  k.  And  since  x  satisfies  (3),  it  follows  that 

^Afx,  =  J2^i? ^'>z.^=6(5). 

leS  ;  =  1 

Hence,  by  strong  duality  i'(7r)  and  y  are  an  optimal  primal-dual  pair,  and  this  comph^es 
the  proof.  □ 

Remark:  Edmonds  [10]  introduced  a  special  class  of  polyhedra  called  polymatroids.  and 
proved  the  classical  result  that  the  greedy  algorithm  solves  the  linear  optimization  |3roblem 
over  a  polyhedron  for  every  linear  objective  function  if  and  only  if  the  polyhedron  is  a 
polymatroid.  Now,  in  the  case  that  Af  =  1,  for  /  £  S.  and  S  C  E .  ii  is  ea.s_\  to  see 
that  ^1  is  the  greedy  algorithm  that  sorts  the  /?, 's  in  nonincreasing  order  By  Edmonds 
result  and  Proposition  2  it  follows  that  in  this  case  B{A.b)  is  a  polymatroid  1  h<^refore. 
extended  polymatroids  are  the  natural  generalizations  of  polymatroids.  and  algorithm  A\ 
is  the  natural  extension  of  the  greedy  algorithm. 

The  fact  that  i'(7r)  and  y  are  optimal  solutions  has  some  important  consequences  It  :<« 
well  known  that  every  extreme  point  of  a  polyhedron  is  the  unique  maLXimizer  of  som^  liiu- ir 
objective  function.  Therefore,  the  i'(:r)'s  are  the  only  extreme  points  of  PiA.b).  Uowrr  it 
follows: 

Theorem  1  (Characterization  of  Extreme  Points)    The  set  of  eitremt  points  nfPiA.h) 
is 

{v(n)  :  n  ts  a  permutation  of  E  ] . 

The  optimality  of  the  adaptive  greedy  algorithm  Ai  leads  naturally  to  the  definiin^n 
of  certain  indices,  which  for  historical  reasons,  that  will  be  clear  later,  we  call  generahz' d 
Gittins  indices. 

Definition  4  (Generalized  Gittins  Indices)  Let  y  be  the  optimal  dual  solution  y,e  lu  r- 
ated  by  algorithm  ^i.  Let 

7,  =       ^      /,      ,-€f.  (11) 

S:  E0S31 
We  say  that  7i ,  .  , . ,  7n  are  the  generalized  Gittins  indices  of  linear  program  (P). 
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Remark:  Notice  that  by  Proposition  1(a)  and  the  definition  of  y,  it  follows  that  if  permu- 
tation n  is  an  output  of  algorithm  A\  then  the  generalized  Gittins  indices  can  be  coni|iutod 
as  follows; 


12) 

13) 


Let  9?_  =  {ar  6  3?  :  I  <  0}.  Let  71,  . .  . ,  7„  be  the  generalized  Gittins  indices  of  (P).  Let 
TT  be  a  permutation  of  E.  Let  T  be  the  following  n  x  n  lower  triangular  matrix: 

/I     0     ...     0\ 
1      1      ...     0 

Vi    1         1/ 

In  the  next  proposition  and  the  next  theorem  we  reveal  the  equivalence  between  some 
optimality  conditions  for  linear  program  (P). 

Proposition  3    The  following  statements  are  equivalent: 

(a)  TT  satisfies  (9)  and  (10): 

(b)  T  is  an  output  of  algorithm  A\: 

(c)  R^A~^  £  3J!1~    X  9?,  and  then  the  generalized  Gittins  indices  are  given  by  'r  —  R-AZ^T: 

(d)  T^r,  <  7t2  <  ••■  <  7>r„. 

Outline  of  the  proof 

(a)  ^  (b):  Proved  in  Proposition  1(a). 

(b)  ^  (c):  It  is  clear,  by  construction  in  ^1,  that 


i^  =  R.AZ\ 


14) 


Now,  in  the  proof  of  Proposition  2  we  showed  that  i^  G  3?"      x  3?.  Moreover,  by  ( 12)  we  get 


7^  =  ^^■ 


and  by  (14)  it  follows  that 


7,  =  R,AZ'T 
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(c)  =>  (d):  By  (c)  we  have 


=  7,r-i  ^r„a;'  £^1-'  x«, 


whence  the  result  follows. 

(d)  =>  (a):  By  construction  of  5  in  algorithm  A2,  the  fact  that  y  =  y  and  the  definition  of 

the  generalized  Gittins  indices,  it  follows  that 


7,  =  ^1  +  ...  4-^^,  fovi£Hh,      and  t  =  1, 


lo) 


Also,  it  is  easy  to  see  that  9j  <  0,  for  j  >  2.  These  two  facts  clearly  imply  that  t  must 
satisfy  (10),  and  hence  (9).  which  completes  the  proof  of  the  proposition    □ 

Combining  the  result  that  algorithm  A\  solves  linear  program  (P)  optimal!.\  with  the 
equivalent  conditions  in  Proposition  3,  we  obtain  several  optimality  conditions,  as  sliov^n 
next. 

Theorem  2  (Sufficient  Optimality  Conditions  and  Indexability)  Aisuwe  fhaf  any 
of  the  conditions  (a)-(d)  of  Proposition  3  holds.  Then  i'(t)  solves  linear  program  [P)  opti- 
mally. 

It  is  easy  to  see  that  conditions  (a)-(d)  of  Proposition  3  are  not,  in  general,   necessary 
optimality  conditions.  They  are  neccessary  if  the  polytope  B(A,b)  is  nondegenerate    Some 
consequences  of  Theorem  2  are  the  following: 
Remarks: 

1.  Sensitivity  analysis:  Optimality  condition  (c)  of  Proposition  3  is  specially  vvi='ll 
suited  for  performing  sensitivity  analysis  Consider  the  following  question:  given  a 
permutation  n  of  E,  for  what  vectors  R  and  matrices  .4  can  we  guarantee  that  l(t) 
solves  problem  (P)  optimally?  The  answer  is:  for  R  and  A  that  satisfy  the  condition 

RkAZ^  €  3?r^  X  9?. 
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We  may  also  zisk:  for  which  permutations  tt  can  we  guarantee  that  uiw)  is  optimal 
By  Proposition  3(d),  the  answer  now  is:  for  permutations  n  that  satisfy 

thus  providing  an  0(n  log  n)  optimality  test  for  tt.  Glazebrook  [17]  addressed  the 
problem  of  sensitivity  analysis  in  stochastic  scheduling  problems.  His  results  are  in 
the  form  of  suboptimality  bounds. 

2.  Explicit  formulae  for  Gittins  indices:  Proposition  3(c)  provides  an  exi:)licit  for- 
mula for  the  vector  of  generalized  Gittins  indices.  The  formula  reveals  that  the  indices 
are  piecewise  linear  functions  of  the  reward  vector. 

3.  Indexability:  Optimality  condition  (d)  of  Proposition  3  shows  that  any  permutation 
that  sorts  the  generalized  Gittins  indices  in  nonincreasing  order  provides  an  optmial 
solution  for  problem  (P).  Condition  (d)  thus  shows  that  this  class  of  optimization 
problems  has  an  indexability  property. 

In  the  case  that  matrix  A  has  a  certain  special  structure,  the  computation  of  th^  indices 
of  (P)  can  be  simplified.     Let  E  be  partitioned  as  E  =  U;.\_j  E'^. .     For  ^-   =    1.  h' . 

let  5(^4^,6*^)  be  the  base  of  an  extended  polymatroid;  let  x''  =  (x|''),g£^;  let  (P;. )  bo  the 
following  linear  program; 

{Pk)  max{  Y,  ^^-f  •  ^^  e  5(.4^6'^)};  (1(3) 

let  {7,- }i6E^  be  the  generalized  Gittins  indices  of  problem  (Pk)-  Assume  that  the  follou  ing 
independence  condition  holds: 

Af  =  Af^^"  =  iA'')f''^' ,      for  f  €  5  n  Ek     and  SCE.  (17) 

Under  condition  (17)  there  is  an  easy  relation  between  the  indices  of  problems  (P)  and 
(Pk),  as  shown  in  the  next  result. 

Theorem  3  (Index  Decomposition)    Under  condition  (11),  the  generalized  Giflins  in- 
dices of  linear  programs  (P)  and  (Pk)  satisfy 

7,  =  7*,      forieEk      and  k  =  I K.  (18) 
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Proof 
Let 

hi  =  7*,  for  /  G  Ek     and     h  =  \ A' 

Let  us  renumber  the  elements  of  E  so  that 

/ij    <  /l2    <    ■  ■•   <  /in 

Let  T=:(l,...,n).  Permutation  tt  of  £  induces  permutations  Tr*^  of  E^.  for  fc  =  1. 
that  satisfy 

Hence,  by  Proposition  3  it  follows  that 


(IH) 


(20) 


/v. 


(21) 


7:.  =  /?;.( -4^*)-' T,.  fort  =  l A- 


or,  equivalently, 


WjT]  >  7^2  1  •  •  ■  '  77r*  / 


0       r-'.4? 


0       \ 
0 

-I  u- 


=  (/;;,, i?^...  ./?:,)      (22) 


\     0  0  r-'.4:./ 

where  Tk  is  an  jE^I  x  \Ek\  matrix  with  the  sani»^  structure  as  matrix  T.  for  k  =  1. 
On  the  other  hand,  we  have 

/I        0      0     ..  0      0\ 

•1       1       0     ...  0      0 

T-'Ar     =  0      -1     1     ...  0      0     .4^ 


\  0 
/ 


0      0 


-1  1/ 


.4 


{1,2} 


{1 "}         j{l "-'}         .{1 "}         a{1 n-1} 


\A\ '-'-A 


.4 


--4 


0       \ 
0 

.4i^ "^  I 


Now,  notice  that  if  i  £  Ek,  j  £  E  \  Ek  and  i  <  j  then,  by  (17): 

^{1 j)  _  ^{1 }}<^E,  _  ^{\ j-i}n£t  _  ^{1 j-i} 


(23) 
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Hence,  by  (19)  and  (23)  it  follows  that  system  (22)  can  be  written  equnalentiy  as 

KT-^A,^R^.  (24) 


Now,  (20)  and  (24)  imply  that 


hi  -  /13 


R^A-'  =  KT-^  = 


G  D?r^  X  K, 


12o) 


hn-l   -  hn 
\  /in  / 

and  by  Proposition  3  it  follows  that  the  generalized  Gittins  indices  of  problem  (P)  satisfy 


7.  =  R.a;'t 


Hence,  by  (24), 


h,  =  7i, 


for  ;•  €  E 


and  this  completes  the  proof  of  the  theorem.  □ 

Theorem  3  implies  that  the  fundamental  reason  for  decomposition  to  hold  is  (  17)  An 
easy  and  useful  consequence  of  Theorems  2  and  3  is  the  following: 

Corollary  1  Under  the  assumptions  of  Theorem  3,  an  optimal  solution  of  problr-m  {/') 
can  be  computed  by  solving  the  K  subproblems  (Pk).  for  k  =  1,. . . .  K  by  algorithm  A\  (H'd 
computing  their  respective  generalized  Gittins  indices. 

It  is  important  to  emphasize  that  the  index  decomposition  property  is  much  stronger  th  u 
the  indexability  property.  We  will  see  later  that  the  classical  multi-armed  bandit  prolikm 
has  the  index  decomposition  property.  On  the  other  hand,  we  will  see  that  Klimov's  proM<>in 
(see  [22])  has  the  indexability  property,  but  in  the  general  case  it  is  not  decomposalik 

2.3     Generalized  Conservation  Laws 

Shantikumar  and  Yao  [26]  formalized  a  definition  of  strong  conservation  taw?  for  perfor- 
mance meaisures  in  genera!  multiclass  queues,  that  implies  a  polymatroidal  structure  m 
the  performance  space.  We  next  present  a  more  general  definition  of  generalized  foi/sfr- 
vation  laws  in  a  broader  context  that  implies  an  extended  polymatroidal  structure  in  tlie 
performance  space,  which  has  several  interesting  and  important  implications.    Consider  a 
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general  dynamic  and  stochastic  job  scheduling  process.  There  are  n  job  types,  whirh  we 
label  i  E:  E  —  {l,...,n}.  We  consider  the  class  of  admissible  scheduling  pohcit^.  which 
we  denote  U ,  to  be  the  class  of  all  nonidling,  nonpreemtive  and  nonantiripative  scheduling 
policies. 

Let  i"  be  a  performance  measure  of  type  i  jobs  under  admissible  policy  u.  for  /  £  E . 
We  assume  that  r"  is  an  expectation.  Let  r"  be  the  corresponding  performance  \ector. 
Let  x^  denote  the  performance  vector  under  a  fixed  priority  rule  that  assigns  priorities  to 
the  job  types  according  to  the  permutation  n  =  (tti,  . .  . ,  7r„)  of  E,  where  type  n„  has  the 
highest  priority,  . . . ,  type  ttx  has  the  lowest  priority. 

Definition  5  (Generalized  Conservation  Laws)  The  performance  vector  x  is  said  to 
satisfy  generalized  conservation  laws  if  there  exist  a  function  6:2  —  ')?+  such  that  6(0)  =  0 
and  a  matrix  .4  —  (/if  )teE,SC£  satisfying  (1)  such  that: 

(a) 

6(S)  =  ^-4fx^       for  all  TT  :  {ti 7r|5|}  =  5     and     S  C  E:  (2(5) 

(b) 

^Afx"^  >b(S),      for  all  SCf"  and  ^^  Af  x";  =  b{E),         for  all  </ G //     (27) 

i€S  .e£ 

In  words,  a  performance  vector  is  said  to  satisfy  generalized  conservation  laws  if:  there 
exist  weights  Af  such  that  the  total  weighted  performance  over  all  job  types  is  una  riant 
under  any  admissible  policy,  and  the  minimum  weighted  performance  over  the  job  types 
in  any  subset  S  C  E  is  achieved  by  any  fixed  priority  rule  that  gives  priority  to  all  other 
types  (in  S'^)  over  types  in  5.  The  strong  conservation  laws  of  Shantikumar  and  ^'ao  [26] 
correspond  to  the  special  case  that  all  weights  are  Af  =  1. 

The  connection  between  generalized  conservation  laws  and  extended  polymatroids  is  the 
following  theorem: 

Theorem  4  Assume  that  the  performance  vector  x  satisfies  generalized  conservation  laws 
(26)  and  (27).   Then 

(a)  The  vertices  of  B{A,b)  are  the  performance  vectors  of  the  fixed  priority  rules,  and 
x"  =  v{ir),  for  every  permutation  t  of  E. 

(b)  The  extended  polymatroid  base  3(A,b)  is  the  performance  space. 
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Proof 

(a)  By  (26)  it  follows  that  x'"  =  v(ir).  And  by  Theorem  1  the  result  follows. 

(b)  Let  X  =  {x":uGi/}be  the  performance  space.  Let  Bi.[A,b)  be  the  .set  of  extreme 
pointsof  5(yl,6).  By  (27)  it  follows  that  X  C  ^(/l,  6).  By  (a), -B„(,4,  6)  C  .V.  Hence,  since 
X  is  a  convex  set  {U  contains  randomized  policies)  we  have 

B{A,b)  =  com{B{A,b))  CX. 

Hence  X  =  B{A,b),  and  this  completes  the  proof  of  the  theorem.  □ 

As  a  consequence  of  Theorem  4,  it  follows  by  Caratheodory  theorem  that  the  perfor- 
mance vector  i"  corresponding  to  an  admissible  policy  u  can  be  achieved  by  a  randomization 
of  at  most  n  +  1  fixed  priority  rules. 

2.4     Optimization  over  systems  satisfying  generalized  conservation  laws 

Let  x"  be  a  performance  vector  for  a  dynamic  and  stochastic  job  scheduling  process  that 
satisfies  generalized  conservation  laws  (associated  with  .4,  6()).  Suppose  that  we  want 
to  find  an  admissible  policy  u  that  maximizes  a  linear  reward  function  J2i^£  Ri-i'"  ■  This 
optimal  scheduling  control  problem  can  be  expressed  as 

(ft/)  max{  ^ /?,x,"  :  uGi/}.  (28) 

By  Theorem  4  this  control  problem  can  be  transformed  into  the  following  linear  program- 
ming problem; 

(P)  max{^/?,z,  :2:Ge(/l,6)}.  (29) 

The  strong  structural  properties  of  extended  polymatroids  lead  to  strong  structural  prop- 
erties in  the  control  problem.  Suppose  that  to  each  job  type  /  we  attach  an  index.  -, .  A 
policy  that  selects  at  each  decision  epoch  a  job  of  currently  largest  index  will  be  referred  to 
as  an  index  policy. 

Let  7i,  ...,  7„  be  the  generalized  Gittins  indices  of  linear  program  (P).  As  a  direct 
consequence  of  the  results  of  Section  2.2  we  show  next  that  the  control  problem  (Pu)  is 
solved  by  an  index  policy,  with  indices  given  by  71, . .  .  ,7n 

Theorem  5  (Indexability)  (a)  Let  i'(7r)  be  an  optimal  solution  of  linear  program  {P). 
Then  the  fixed  priority  rule  that  assigns  priorities  to  the  job  types  according  to  permulation 
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n  IS  optimal  for  the  control  problem  (Pn); 

(b)  A  policy  that  selects  at  each  decision  epoch  a  job  of  currently  largest  generalized  C'ttins 

index  is  optimal  for  the  control  problem. 

The  previous  theorem  implies  that  systems  satisfying  generalized  conservation  laws  are 
indexable  systems. 

Let  us  consider  now  a  dynamic  and  stochastic  project  selection  process,  in  which  there 
are  K  project  types,  labeled  it  =  1,...,A'.  At  each  decision  epoch  a  project  must  he 
selected.  A  project  of  type  k  can  be  in  one  of  a  finite  number  of  states  u-  £  Ek  These 
states  correspond  to  stages  in  the  development  of  the  project.  Clearly  this  process  can  be 
interpreted  as  a  job  scheduling  process,  as  follows:  simply  interpret  the  action  of  selecting 
a  project  k  in  state  i^  G  Ek  as  selecting  a  job  of  type  /  =  i^  £  \Jki-iEk'  We  may  interpret 
that  each  project  consists  of  several  jobs.  Let  us  assume  that  this  job  scheduling  process 
satisfies  generalized  conservation  laws  associated  with  matrix  .4  and  set  function  b().  By 
Theorem  5,  the  corresponding  optimal  control  problem  is  solved  by  an  index  policy  We 
will  see  next  that  when  a  certain  independence  condition  among  the  projects  is  satisfied,  a 
strong  index  decomposition  property  holds. 

We  thus  assume  that  E  is  partitioned  as  E  =  [Jk=\Ek-  Let  x  =  (x,),g£j.  bo  the 
performance  vector  over  job  types  in  E^  corresponding  to  the  project  selection  inoMeni 
obtained  when  projects  of  types  other  than  k  are  ignored  (i.e.,  they  are  never  engaged).  Let 
us  assume  that  the  performance  vector  x  satisfies  generalized  conservation  laws  associated 
with  matrix  A''  and  set  function  b''{),  and  that  the  independence  condition  ( 17)  is  satisfied. 
Let  Uk  be  the  corresponding  set  of  admissible  policies. 

Under  these  aissumptions,  Theorem  3  applies,  and  together  with  Theorem  5(b)  we  get 
the  following  result; 

Theorem  6  (Index  Decomposition)  Under  condition  (17),  the  generalized  Gitltn.'^  in- 
dices of  job  types  in  E^  only  depend  on  characteristics  of  project  type  k. 

The  previous  theorem  identifies  a  sufficient  condition  for  the  indices  of  an  indexable 
system  to  have  a  strong  decomposition  property.  Therefore,  systems  that  satisfy  generalized 
conservation  laws  which  further  satisfy  (17)  are  decomposable  systems.  For  such  systems  the 
solution  of  problem  (i^)  can  be  obtained  by  solving  K  smaller  independent  subproblems. 
This  theorem  justifies  the  term  generalized  Gittins  indices.    We  will  see  in  Section  4  that 
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when  applied  to  the  multi-armed  bandit  problem,  these  indices  reduce  to  the  usual  Gittins 
indices. 

Let  us  consider  briefly  the  problem  of  optimizing  a  nonlinear  cost  function  on  the  perfor- 
mance vector.  Bhattacharya  et  al.  [4]  addressed  the  problems  of  separable  convex,  min-max. 
lexicographic  and  semi-separable  convex  optimization  over  an  extended  polymatroid.  and 
provided  iterative  algorithms  for  their  solution.  Analogously  as  what  we  did  in  the  linear 
reward  case,  the  control  problem  in  the  Ccise  of  a  nonlinear  reward  function  can  be  reduced 
to  solving  a  nonlinear  programming  problem  over  the  base  of  an  extended  polymatroid. 

3     Branching  Bandit  Processes 

Consider  the  following  branching  bandit  process  introduced  by  Weiss  [35].  wiio  observed  that 
it  can  model  a  large  number  of  dynamic  and  stochastic  scheduling  processes.  There  is  a 
finite  number  of  project  types,  labeled  k  —  1.  ,  A'  A  type  Ic  project  can  he  in  one  of 
a  finite  number  of  states  i^  £  E^.  which  correspond  to  stages  in  the  development  of  tiie 
project.  It  is  convenient  in  what  follows  to  combine  these  two  indicators  into  a  single  label 
i  =  ik,  the  state  of  a  project.  Let  E  =  U^^^Ek-  =  {l,...,n}  be  the  finite  set  of  possible 
states  of  all  project  types. 

We  associate  with  state  i  of  a  project  a  random  time  c,  and  random  arrivals  \,  — 
{!^tj)j^E-  Engaging  the  project  keeps  the  system  busy  for  a  duration  (,  (the  duration  of 
stage  i),  and  upon  completion  of  the  stage  the  project  is  replaced  by  a  nonnegative  integer 
number  of  new  projects  N,j,  in  states  j  £  E  We  assume  that  given  /',  the  durations  and  the 
descendants  Vi,  N,  are  random  variables  with  an  arbitrary  joint  distribution,  independent 
of  all  other  projects,  and  identically  distributed  for  the  same  i.  Projects  are  to  be  selected 
under  a  nonidling,  nonpreemptive  and  nonantinpative  scheduling  policy  u.  We  shall  refer 
to  this  class  of  policies,  which  we  denote  //.  as  the  class  of  admissible  policus  The  decision 
epochs  are  t  =  0  and  the  instants  at  which  a  project  stage  is  completed  and  there  is  some 
project  present.  If  m,  is  the  number  of  projects  in  state  ;'  present  at  a  given  time,  then  it 
is  clear  that  this  process  is  a  semi-Markov  decision  process  with  states  m  —  (mj,    .    ,  lUr,  )• 

The  model  of  arm-acquiring  bandits  (see  Whittle  [37],  [38])  is  a  special  case  of  branching 
bandit  process,  in  which  the  descendants  .V,  consist  of  two  parts:  (1)  a  transition  of  the 
project  engaged  to  a  new  state,  and  (2)  external  arrivals  of  new  projects,  independent  of  / 
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or  of  the  transition.    The  cliissical  multi-armed  bandit  problem  corresponds  to  the  special 
case  that  there  are  no  external  arrivals  of  projects,  and  the  stage  durations  are  1. 

The  branching  bandit  process  is  thus  a  special  cjise  of  a  project  selection  process  There- 
fore, as  described  in  Subsection  2.4,  it  can  be  interpreted  as  a  job  scheduling  process  En- 
gaging a  type  i  job  in  the  job  scheduling  model  corresponds  to  selecting  a  project  of  state 
I  in  the  branching  bandit  model.  We  may  interpret  that  each  project  consists  of  se\eral 
jobs.  In  the  analysis  that  follows,  we  shall  refer  to  a  project  in  state  /  as  a  type  i  job. 
In  this  section,  we  will  define  two  different  performance  measures  for  a  branching  liandit 
process.  The  first  one  will  be  appropriate  for  modelling  a  discounted  reward-tax  structure. 
The  second  one  will  allow  us  to  model  an  undiscounted  tax  structure.  In  each  case  we  will 
show  that  they  satisfy  generalized  conservation  laws,  and  that  the  corresponding  optimal 
control  problem  can  be  solved  by  a  direct  application  of  the  results  of  Section  2. 

Let  5  C  £■  be  a  subset  of  job  types.  We  shall  refer  to  jobs  with  types  in  5  as  5-jobs 
Assume  now  that  at  time  t  =  0  there  is  only  a  single  job  in  the  system,  which  is  of  type  / 
Consider  the  sequence  of  successive  job  selections  corresponding  to  an  admissible  polic\  n 
that  gives  complete  priority  to  5-jobs.  This  sequence  proceeds  until  all  5-jobs  are  exhaust,  cl 
for  the  first  time,  or  indefinitely.  Call  this  an  (i.S)  period.  Let  T,^  be  the  duration  (pos>-ihl\ 
infinite)  of  an  (i.S)  period.  It  is  easy  to  see  that  the  distribution  of  T^  is  indepeniliMit  of 
the  admissible  policy  used,  as  long  as  it  gives  complete  priority  to  5-jobs  Note  thai  .in 
(i,  0)  period  is  distributed  as  i',.  It  will  be  convenient  to  introduce  the  following  addiiicuMl 
notation: 

Vi^k  =  duration  of  the  ibth  selection  of  a  type  /job;  notice  that  the  distribution  of  i  ,j..  is 
independent  of  ib  (vi). 

'''i,k  =   time  at  which  the  kth  selection  of  a  type  i  job  occurs: 

f,  =  number  of  times  a  type  :  job  is  selected  (can  be  infinity); 

{^i^k}k>i—  duration  of  the  (2,5)-period  that  starts  with  the  tth  selection  of  a  type  /job 
type  i  job  for  the  fcth  time. 

Qiit)  =   number  of  type  i  jobs  in  the  system  at  time  /.    Q{t)  denotes  the  vector  of  the 
Q,{tYs.  We  assume  Q(0)  =  (mj, . .  .  ,  m„)  is  known. 
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ri,  if 

1 0,     ot: 


T^  =   time  until  all  S-jobs  are  exhausted  for  the  first  time  (can  be  infinity);  note  rhat  T^ 
is  the  duration  of  the  busy  period. 

a  type  i  job  is  being  engaged  at  time  t. 
otherwise, 

Aff.  =   inf {  A  >  Vtk  :  Hjes  ^jC'"'.*^  +  ^)  =  1  }-  for  '  ^  ^'  "°t^  that  Af^^.  is  the  InttM-val 
between  the  ifcth  selection  of  a  type  j  job  and  the  next  selection,  if  any.  of  an  Vjob 
If  no  more  jobs  in  5  are  selected,  then  Af^  =  Tf^,  the  remaining  interval  of  the  busy 
period. 

A^  =  inf { t  :  ^,g5  Ii{t)  =  1  };  note  that  A^  is  the  interval  until  the  first  job  in  S.  if  any, 
is  selected.  If  no  job  in  S  is  selected,  A^  =  T^ ,  the  busy  period. 

Proposition  4  Assume  that  jobs  are  selected  in  the  branching  bandit  procfis  under  an  ad- 
missible policy.    Then,  for  every  S  C  E : 

(a)  If  the  policy  gives  complete  priority  to  S'^ -jobs  then  the  busy  period  [O.T",^)  con  hi  par- 
titioned as  follows: 

[o,r^)  =  [o,rf  )U  0^'^-^..'=  +  ^'^')       «■•  p-  1-  (30) 

i6Sfc=l 

(b)  The  busy  period  [0,T^)  can  be  partitioned  as  follows: 

[0,T^)  =  [0,Ai)\J  (j[r,,,r..,  +  Af,)  w.   p.   1  (31) 

iesfc=i 

(c)  The  following  inequalities  hold  w.  p.   I: 

^tk<Tf.i  (32) 

and 

A^<Tf.  (33) 

Proof 

(a)  Intuitively  (30)  expresses  the  fact  that  under  a  policy  that  gives  complete  priority  to 
5'^-jobs,  the  duration  of  a  busy  period  is  partitioned  into  (1)  the  initial  interval  in  which 
all  jobs  in  S'^  are  exhausted  for  the  first  time,  and  (2)  intervals  in  which  all  jobs  in  5  are 
exhausted,  given  that  after  working  on  a  job  in  S  we  clear  first  all  jobs  in  S'  that  were 
generated. 
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More  formally,  let  u  be  an  admissible  policy  that  gives  complete  priority  to  5'^-.|olis.  [t 
is  easy  to  see  then  that  the  intervals  in  the  right  hand  side  of  (30)  are  disjoint  Moreover, 
the  inclusion  D  is  obvious.  In  order  to  show  that  (30)  is  indeed  a  partition,  let  us  shovv  the 
inclusion  C.  Let  t  £  [0,  T^)  \  [0,  7"^'),  otherwise  we  are  done.  Since  u  is  a  nonidlmg  policy, 
at  time  t  some  job  is  being  engaged.  Let  j  be  the  type  of  this  job.  If  j  €  5  then  it  is  clear 
that  t  G  [T'j.fc.  Tj.A:  +  Tjl)  for  some  k,  and  we  are  done.  Let  us  assume  tiiat  j  £  S' .  Let  us 
define 

D  =  {r,,fc  :  ie  S,k  €  {1 iy,},and  r.,|..  <t}. 

Since  t  >  T^"  €  D  it  follows  that  D  ^  0.  Now,  since  by  hypothesis  E[(,]  >  0.  for  all  /.  it 
follows  that  D  is  a  finite  set.  Let  i'  6  S  and  k'  be  such  that 

r,.  t*  =  max  r. 

Assume  that  r,.jf  +7",.  ^.  <  t.  Now,  r,«,fc.  +T,Tf..  is  a  decision  epocii  at  which  5'~  is  empty. 
Since  the  policy  is  nonidling,  it  follows  that  at  this  epoch  one  starts  working  on  some  type  / 
job,  with  !  G  5,  that  is,  r, j^  =  ''I'.jf  +  ^r  k- •  contradicting  the  definition  of  "  .  ^..  Hence,  it 
must  be  t  <  r,.  jf  +  7",f  \. .  And  by  definition  of  D  it  follows  that  /  G  ['..a...  -  .j,..  +  T^  ;_. ). 
and  this  completes  the  proof  of  the  proposition. 

(b)  Equality  (31)  formalizes  the  fact  that  under  an  admissible  policy  the  bus}  period 
can  be  decomposed  into  (1)  the  interval  until  tiie  first  job  in  5  is  selected.  ("2)  tiie  disjoint 
union  of  the  intervals  between  selections  of  successive  5-jobs  and  (3)  the  interval  between 
the  laist  selection  of  a  job  in  5  and  the  end  of  the  busy  period.  Note  that  if  no  .S-job  is 
selected,  then  i/,  =  0,  for  i  G  S,  and  A^  =  T^,  thus  reducing  the  partition  to  a  single 
interval. 

(c)  Let  Tj^jt  be  the  time  of  the  fcth  selection  of  a  type  i  job  (i  G  5).  Since  the  next  selec- 
tion (if  any)  of  an  5-job  can  occur,  at  most,  at  the  end  of  the  (i,S^)  period  [r,  j. .  r, ;.  +  7,"'  ). 
inequality  (32)  follows.  On  the  other  hand,  since  the  time  until  the  first  selection  of  an 
5-job,  A^,  can  be  at  most  the  duration  of  the  initial  {i,S'^)  period.  (33)  follows.  □ 


3.1     Discounted  Branching  Bandits 

In  this  subsection  we  will  introduce  a  family  of  performance  measures  for  branching  bandits. 
{i"(Qr)}Q>o,  that  satisfy  generalized  conservation  laws.  They  are  appropriate  for  modelling 

23 


a  linear  discounted  reward-tax  structure  on  the  branching  bandit  process.  We  have  already 
defined  the  indicator 

1,     if  a  type  ;'  job  is  being  engaged  at  time  /: 

0,     otherwise, 
and,  for  a  given  a  >  0,  we  define 


/.«)  = 


34) 


xr(a)  =  E„ 


re-^'l,{t)dt     =    r  E,[I,{t)]e-'" 
Jo  Jo 


dt,      i  e  E. 


(33) 


3.1.1      Generalized  Conservation  Laws 


In  this  section  we  prove  that  the  performance  measure  for  branching  bandits  defined  in  (3)) 
satisfies  generalized  conservation  laws.  Let  us  define 


4S      nso'   e-oUt] 


and 


6a(5)  =  E 
The  main  result  is  the  following 


/; 


e-^'di 


-  E 


/; 


e-'^'di 


3(3) 


[37; 


Theorem  7  (Generalized  Conservation  Laws  for  Discounted  Branching  Bandits) 
The  performance  vector  for  branching  bandits  J"" (a)  satisfies  generalized  conservntion  laujs 
(26)  and  (27)  associated  with  matrix  Aq  and  set  function  bd). 

Proof 

Let  S  C  E.  Let  us  assume  that  jobs  are  selected  under  an  admissible  policy  u.  This  gener- 
ates a  branching  bandit  process.  Let  us  define  two  random  vectors,  (r})i^E  and  (r,  "  ),^s- 
as  functions  of  its  sample  path  as  follows: 


and 


r]     =      /      I,(t)e-^'dt  =  Y, 

Jo  ;.  ^j  Jr., 

rrt  Jo 


e-"'  dt 


e-^'  di, 


,"'^  =  V  e-°--.*  /   '■    e-^'dt,      ieS. 
t^x  J° 


(38) 


(39) 
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Now,  we  have 


xr(a)     =     E,[r|]  =  E„ 


e-"'  di 


=     E„ 


=     E 


f:E[e— .Mt^,]E[r%-'^' 

■    rv, 

/      e-°'c/<     Eu 
.Jo  J 


d/ 


(40) 
(41) 


Note  that  equality  (40)  holds  because,  since  u  is  nonanticipative,  r,  j.  and  i ,  jt  are  uulepen- 
dent  random  variables.  On  the  other  hand,  we  have 


Eu[r."'']     =     E.    t,'"""  I 

■  "•     r  /•T'r 


e-^'d/ 


Eu 


it,  ^0 


f-^'ty/  I  (A 


=    E„ 


=     E 


dt 


(4-J) 


=       -Ka^ 


f    e-^< 

^0 


dt 


Eu 


L*:=l 


e-°'dt 


Eu 


Z^- 


L*:=l 


143) 


Note  that  equality  (42)  holds  because,  since  u  is  nonanticipative,  r,,^  and  7","^,  are  indepen- 
dent. Hence,  by  (41)  and  (43) 


11,5 


Eu[r."-^]  =  .4f:.xr(a).      ieS, 


and  we  obtain; 


Eu 


(44) 


(43) 


We  first  show  that  generalized  conservation  law  (26)  holds.  Consider  a  policy  t  that  gives 
complete  priority  to  5'^-jobs.  Applying  Proposition  4  (part  (a)),  we  obtain: 


■r>.k  +  T.^, 


fi-^'  dt 


Jo  Jo  ,est=i^^.> 

^0  ^- 


di 


■  65  *:=1 

II. 5 

r 

'€5 


(46) 
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Hence,  taking  expectations  and  using  equation  (45)  we  obtain 


I' 

Jo 


-^'dt 


=  E 


r 


e-^Ut 


+  Y,Atx:{Q) 


le* 


or  equivalently,  by  (37), 

Y,AtzUa)  =  b,{S), 

which  proves  that  generalized  conservation  law  (26)  holds. 

We  next  show  that  generalized  conservation  law  (27)  is  satisfied.  Since  jobs  arc  selectt^d 
under  admissible  policy  u.  Proposition  4  (part  (b))  applies,  and  we  can  write 


e-'"d<  + 


'•.,*+^.-/. 


e-'^M/. 


Jo  Jo 

On  the  other  hand,  we  have 

>  EE/ 

Jo  Jo 

>  r^e-^uf-f 

Jo  Jo 

=      /       e--df-/ 

Notice  that  (48)  follows  by  Proposition  4  (part  (c)),  (49)  follows  by  (47),  and  (JU)  li> 
Proposition  4  (part  (c)).  Hence,  taking  expectations  in  (51),  and  applying  (45)  we  nliiani 


e-^"  dt 

e-'^'dt 
e-^'dt 


e-^'dt. 


(4S) 

(•JOi 
(51^ 


r  ■*  m 


>     E 


/ 


e"-"'  dt 


i: 


r'^'  df 


=      ba{S) 


{')■>) 


which  proves  that  generalized  conservation  law  (27)  holds,  and  this  completes  the  ptcur  of 
the  theorem.  □ 

Hence,  by  the  results  of  Subsection  2.3  we  obtain: 

Corollary  2  The  performance  space  for  branching  bandits  corresponding  to  the  perfor- 
mance vector  x^{a)  is  the  extended  polymatroid  base  B{Aa.ba);  furthermore,  the  verfirfs  of 
B{Aa,ba)  are  the  performance  vectors  corresponding  to  the  fixed  priority  rvles. 
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3.1.2      The  Discounted  Reward-Tax  Problem 

Let  us  associate  with  a  branching  bandit  process  the  following  linear  reward-tax  stniciure; 
An  instantaneous  reward  of  R,  is  received  at  the  completion  epoch  of  a  type  /  job  In 
addition,  a  holding  tax  C,  is  incurred  continuously  during  the  interval  tiiai  a  type  /job  is 
in  the  system.  Rewards  and  taxes  are  discounted  in  time  with  a  discount  factor  a  >  U  Lt^t 
us  denote 

Vu,a'     (m)  =  expected  total  present  value  of  rewards  received  minus  taxes  incurred  under 
policy  u,  given  that  there  are  initially  m,  jobs  of  type  i  in  the  system,  for  /  G  E 

The  discounted  reward-tcix  problem  is  the  following  optimal  control  problem:   find  an  ad- 

i ft  C'\ 

missible  policy  u'  that  maximizes  I'u.o"     {m)  over  all  admissible  policies  u    In  thi.'-  seotinn 

we  reduce  the  reward-tax  problem  to  the  pure  rewards  case  (where  C  =  0)    We  also  find  a 

I  fi  o\ 
closed  formula  for  Vu.a'     (t))  and  show  how  to  solve  the  problem  using  algoiitlini  .4i 


The  Pure  Rewards  Case. 

Let  us  introduce  the  transform  of  n,,  i.e.,  "^,{0)  —  E[e"^"'  ].  We  then  have 


Vl^fHm)     =     E. 


i6Efc=l 


E[e 


■av. 


E. 


L*:  =  l 


(53) 
(54) 
(55) 


Notice  that  equality  (54)  holds  by  (41). 

It  is  also  straightforward  to  model  the  case  in  which  rewards  are  received  continuously 
during  the  interval  that  a  type  i  job  is  in  the  system  rather  than  at  a  completion  epoch. 
Let  Vu,a     (ni)  be  the  expected  total  present  value  of  rewards.  Then 


K!,^-°'(^n)  =  E. 


re 


e-''^I,(t)dt 


X;«,xr(Q). 


The  Reward-Tax  Problem;  Reduction  to  the  Pure  Rewards  Case 
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We  will  next  show  how  to  reduce  the  reward-tax  problem  to  the  pure  rewards  case  using  the 
following  idea  introduced  by  Bell  [1]  (see  also  Harrison  [18],  Stidham  [2!^]  and  Whittle  [38] 
for  further  discussion).  The  expected  present  value  of  holding  taxes  is  the  same  whether 
they  are  charged  continuously  in  time,  or  according  to  the  following  charging  scheme  At 
the  arrival  epoch  of  a  type  «  job,  charge  the  system  with  an  instantaneous  entrance  charge 
of  (C/a),  equal  to  the  total  discounted  continuous  holding  cost  that  would  be  inruried  if 
the  job  remained  within  the  system  forever;  at  the  departure  epoch  of  the  job  (if  it  ever  de- 
parts), credit  the  system  with  an  instantaneous  depariure  refund  of  (C',/o)-  thus  refunding 
that  portion  of  the  entrance  cost  corresponding  to  residence  beyond  the  departure  epoch. 
Therefore,  we  can  write 

K^,o'^*(f")     =     E^[Rewards]- E,  [Charges  at/ =  0]  + 

(  Eu[ Departure  refunds]  —  Eu[ Entrance  Charges]  ) 

teE 
=     V<«^^'°'(m)-i:m.(C./a) 


where 


fl:  =  (C./a)-^E[A^.,](C,/a).  (57) 

From  equation  (56)  it  is  straightforward  to  apply  the  results  of  Section  2  to  solve  the  control 
problem:  use  algorithm  A\  with  input  {Ra,Ac^),  where 

>■  a  J  1  —  ^(q) 

Let  7i(a), . . .  ,7n(a)  be  the  corresponding  generalized  Gittins  indices.  Then  we  have 

Theorem  8  (Optimality  and  Indexability:   Discounted  Branching  Bandits)   (a)  Al- 
gorithm Ai  provides  an  optimal  policy  for  the  discounted  reward-tax  branching  bandit  prob- 
lem; 
(b)  An  optimal  policy  is  to  work  at  each  decision  epoch  on  a  project  with  largest  inder  -,(o). 

The  previous  theorem  characterizes  the  structure  of  the  optima!  policy.  Moreover,  since 
in  Proposition  6  below,  we  find  closed  form  expressions  for  the  matrix  .4^,   and  tiie  set 
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function  6a().  we  can  compute  not  only  the  structure,  but  also  the  performance  of  the 
optimal  policy  (optimal  profit,  optimal  extreme  point  of  the  extended  polymalroid)  \ore 
also  that  the  decomposition  of  the  indices  does  not  hold  in  the  general  case;  in  other  words. 
the  generahzed  indices  of  the  states  of  a  type  k  project  depend  in  general  on  characteristics 
of  project  types  other  than  t,  i.e.,  branching  bandits  is  an  example  of  an  indexable  but  not 
decomposable  system.  We  may  also  prove  the  following  result: 

Theorem  9  (Continuity  of  generalized  Gittins  indices)  The  generalized  Giffnis  in- 
dices 7i(a),  •  •  ■  , 7n(a)  are  continuous  functions  of  the  discount  factor  a .  for  a  >  0. 

Proof 

It  is  easy  to  see  that  the  generalized  Gittins  indices  depend  continuously  on  the  in|iut  of 
algorithm  ^i.  Also,  since  the  function  a  >—'  [Rcc-^a)  is  continuous  the  result  follows    D 

The  Pure  Tax  Case:  Minimizing  Time-Dependent  Expected  Number  in  Sys- 
tem 

In  several  applications  of  branching  bandits  (for  example  queueing  systems)  one  is  often 
interested  in  minimizing  a  weighted  sum  of  discounted  time-dependent  e.\pected  numl>er  of 
jobs  in  the  system.  Let  QJ"()  denote  the  Laplace  transform  of  the  time-dependent  expected 
number  of  type  j  jobs  in  the  system  under  policy  u,  i.e., 

g;"(e)=/    E4Q,{t)\Q{0)  =  ni]e-^'dt.    jeE  m) 

Jo 
An  interesting  optimization  problem  is  to: 

The  problem  can  be  modelled  as  a  pure  tax  problem  as  follows: 

i:QQr(a)  =  -K<°/'(m). 
J€E 


and  thus  by  making  i?  =  0,  C,  =  1  and  C,  =  0  for  i  ^  j  in  (56)  we  obtain 

.  €  f  (60) 

See  Harrison  [18]  for  a  similar  result  in  the  context  of  a  multiclass  queue. 


1 1 1' 
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3.1.3      Interpretation  of  Generalized  Gittins  Indices  in  Discounted  Branching 
Bandits 

Consider  the  following  modification  of  the  branching  bandits  problem:  We  modify  the 
original  problem  by  adding  an  additional  project  type,  which  we  call  0.  with  only  one 
state/stage  of  infinite  duration,  that  is,  dq  =  oo  with  probability  1.  A  reward  of  Ro- 
continuously  discounted,  is  received  for  each  unit  of  time  that  a  type  0  project  is  engaged. 
Notice  that  the  choice  of  working  on  project  type  0  can  be  interpreted  as  the  choice  of 
retirement  from  the  original  problem  for  a  pension  of  Ro,  continuously  discounted  in  time. 
Now,  the  modified  problem  is  still  a  branching  bandits  problem.  Let  us  assume  that  at 
time  t  =  0  there  are  only  two  projects  present,  one  of  type  0  and  another  in  state  /  £  E. 
We  may  then  ask  the  following  question:  Which  is  the  smallest  value  of  the  pension  Ro 
which  makes  the  option  of  retirement  (working  on  project  type  0)  preferable  to  the  option 
of  continuation  (working  on  the  project  in  state  i)?  Let  us  call  this  (quitable  snrrdider 
value  Ro{i)-  We  have  then  the  following  result 

Proposition  5  The  generalized  Gittius  indei  of  project  state  i  in  the  original  hrniicbing 
bandits  problem  coincides  with  the  equitable  surrender  value  of  state  i.  Rq(') 

Proof 

Let  7i , .  .  . ,  7n  be  the  generalized  Gittins  indirps  corresponding  to  the  original  branching 
bandits  problem.  Let  7oi7ii--.7n  ^^  ^^^  generalized  Gittins  indices  for  the  modifit^d 
problem.  Let  us  partition  the  modified  slate  space  as  E  =  {0}U£^.  It  is  eas\  to  verifv  that 
the  decomposition  condition  (17)  holds.  Hence  Theorem  3  applies,  and  therefore  wp  jiavp 

7g  =  flo         and         ^,"  =  7;,     J€E.  (61) 

Now,  since  by  Theorem  8  it  is  optimal  to  work  on  a  project  with  largest  current  generalized 
Gittins  index,  it  follows  that  the  surrender  reward  Ro  which  makes  the  options  of  continu- 
ation and  of  retirement  (with  reward  Rq)  equally  attractive  \s  Ro  =  7o  But  by  definition 
R^{i)  is  such  a  breakpoint.  Therefore  R^(i)  =  ")o-  and  the  proof  is  complete.  D 
Whittle  [36],  [38]  introduced  the  idea  of  a  retirement  option  in  his  analysis  of  the  multi- 
armed  bandit  problem,  and  provided  an  interpretation  of  the  Gittins  indices  as  equitable 
surrender  values.  Weber  [34]  also  makes  use  of  this  characterization  of  the  Gittins  indices  in 
his  intuitive  proof.  Here  we  extend  this  interpretation  to  the  more  general  ca.se  of  branching 
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bandits.  FVom  this  characterization  it  follows  that  the  generalized  Gittins  indices  roiiici<le 
indeed  with  the  well  known  Gittins  indices  in  the  classical  miilti-arnied  bandit  piolilem. 
which  justifies  their  name. 


3.1.4      Computation  of  .4o  and  6o(  ) 

The  results  of  the  previous  sections  are  structural,  but  do  not  lead  to  explicit  compulations 
of  the  matrix  .4^  and  the  set  function  ba{)  appearing  in  the  generalized  conservation  laws 
(26)  and  (27)  for  the  branching  bandit  problem.  Our  goal  in  this  section  is  to  computp  from 
generic  data  the  matrLx  Aa  and  the  set  function  6q(  )  Combined  with  the  previous  results 
these  computations  make  it  possible  to  evaluate  the  performance  of  specific  policies  as  well 
as  the  optimal  policy. 

As  generic  data  for  the  branching  bandit  process,  we  assume  that  the  joint  distribution 
of  Vi,{Nij)j^E  is  given  by  the  transform 


<i>,{e.Zi,...,Zn)  =  E 


-1 


.'V.n 


f62) 


In  addition,  we  have  already  introduced  the  the  generating  function  of  the  marginal  distri- 
bution of  V,  (denoted  G,()): 


Jo 


(iyi) 


Finally  the  vector  m  =  (mj, .  . . ,  m^)  of  jobs  initially  present  is  given. 

As  we  saw  in  the  previous  section  the  duration  of  an  (;,  S)-period,  7",^.  plays  a  mirial 
role.  We  will  compute  its  moment  generating  function 

^f(e)  =  E[e-^^."].  ((3  1) 

For  this  reason  we  decompose  the  duration  of  an  (i,5)-period  as  a  sum  of  mdepeiirj.Mit 
random  variables  as  follows: 


•V..; 


je5  k=\ 


((jj) 


where  v,,  {Tji^}k>i  are  independent.  Therefore. 


*f(e)     =     E 


=     E 


jes 


=    <I>.(e,(*f(0))^g.,,l5c),     i£E. 


((3(3) 
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Given  5,  fixed  point  system  (66)  provides  a  way  to  compute  the  values  of  ^,-  [0).  for  /  G  £' 
We  now  have  the  elements  to  prove  the  following  result: 

Proposition  6  (Computation  of  Aq  and  6o())   For  a  branching  bandif  process.  luntriT 
Ac  and  set  function  6a()  satisfy  the  following  relations: 

l-*f(a) 


.4^    = 


ieS,     S  C  E. 


1  -  ^.(a) 

bais) = -  n  I'^Ti'')]"''  -  -  ni^f (^)^^  scE 

Proof 

Relation  (67)  follows  directly  from  the  definition  of  ^4^^    O"  'he  other  hand,  we  have 

m, 


(67) 
(68) 


.€5  fc=l 


(69) 


Hence, 


u: 


,-»( 


dt 


1 

1 

:^ 

--E 

a 

a 

=-"E.€5  ErJi  ^r\ 


i6i 


(70) 


Therefore,  from  (37),  (68)  follows.  □ 
Remarks: 

1.  Note  that  Af^  =  1,  for  i  €  E,  and  b^{E)  =  i  -  ^  OjeEt^f  ('^)]"''       ^'  ^  ^ 

2.  From  Proposition  6  we  can  compute  matrux  .4^  and  set  function  h^()  provided  we 
can  solve  system  (66).  As  an  example,  we  illustrate  the  form  of  the  equations  in  the 
special  case,  in  which  the  type  j  jobs  that  arrive  during  the  time  that  we  work  on 
type  z  job  form  a  Poisson  process  with  rate  A,_,,  i.e., 

*.(a,zx,...,z„)  =  E[e-^'<'^+^.6E-^-(^-^»]=^.(a+^A,,(l-r,)) 
In  this  case,  (66)  becomes 

*f  (a)  =  *.(a  +  Y,  ^'jil  -  *f  (»)])•'■  e  ^  (71) 

.(€5 

As  a  result,  an  algorithm  to  compute  ^f{a)  is  as  follows; 

(1)  Find  a  fixed  point  for  the  system  of  nonlinear  equations  (71)  in  terms  of  ^f  (o  ). 
Although  in  general  (71)  might  not  have  a  closed  form  solution,  m  special  cases  ( c, 
exponential)  a  closed  form  solution  could  be  obtained. 

(2)  From  Proposition  6  compute  {Ac^b^)  in  terms  of  ^f  (a). 
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3.2      Undiscounted  Branching  Bandits 

In  this  section  we  address  branching  bandits  with  no  discounts.  Clearly,  in  tiie  case  of 
pure  rewards  the  problem  is  trivial,  since  all  policies  have  the  same  reward.  Under  a 
linear  undiscounted  tax  structure  on  the  branching  bandit  process,  however,  the  |irol)lem 
becomes  interesting.  Indeed,  since  an  optimal  policy  under  the  time  a\erage  holding  cost 
criterion,  also  minimizes  the  expected  total  holding  cost  in  each  busy  penod  (see  Nain  f  /  nl. 
[24]),  modelling  and  solving  undiscounted  branching  bandits  leads  to  the  solution  of  several 
classical  queueing  scheduling  problems. 

More  importantly,  our  approach  reveals  rigorously  the  connections  of  discounted  and 
undiscounted  problems,  which,  in  our  opinion,  has  not  been  thouroughly  addressed  in  the 
literature.  To  give  a  concrete  example:  after  solving  an  indexable  discounied  scli^duling 
problem,  researchers  say  that  the  same  ordering  of  the  jobs  holds  for  the  undiscounted 
problem  as  the  discount  factor  q  ^  0,  provided  there  are  no  ties  of  the  corresponding 
indices.  It  is  not  clear,  however,  what  happens  when  there  are  ties. 

We  will  introduce  in  this  subsection  a  performance  measure  r"  for  a  branching  bandit 
process  that  satisfies  generalized  conservation  laws  It  is  appropriate  for  modelling  a  lin- 
ear undiscounted  tax  structure  on  the  branching  bandit  process.  We  shall  assume  m  the 
following  development  that  all  the  expectations  that  appear  are  finite  We  will  show  later 
necessary  and  sufficient  conditions  for  this  assumption  to  hold.  Using  the  indicator 

{1,     if  a  type  i  job  is  being  engaged  at  time  i: 
0,     otherwise, 
we  introduced  earlier,  we  let 


-,     —   C.U 


Jo 


{t)idt 


i€E  (72) 


Let  us  define 


and 


^s_^[Tr] 


where 


E[i>,] 
fc(5)  =  |E[(Tf)2]-iE[(rf  )2]  +  5^6.(5),  (74) 

.65 

E[u.]E[vf]  fEjTn       E\iTfy]\ 
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3.2.1      Generalized  Conservation  Laws 


We  prove  next  that  the  performance  measure  for  a  branching  bandit  process  defined  m  (7'2) 
satisfies  generalized  conservation  laws.  The  main  result  is  the  following: 

Theorem  10  (Generalized  Conservation  Laws  for  Undiscounted  Branching  Bandits) 

The  performance  vector  for  branching  bandits  r"  satisfies  generalized  covsert  atiov  /aits  (26) 
and  (27)  associated  with  matrix  A  and  set  function  b(). 

Proof 

Let  S  C  E.  Let  us  assume  that  jobs  are  selected  under  an  admissible  policy  u.  This  gener- 
ates a  branching  bandit  process.  Let  us  define  two  random  vectors.  (fj),6£  and  (r,  ),e'^- 
as  functions  of  the  sample  path  as  follows; 


ri  =  r  i,(t)tdt  =  j2  r 


tdt 


-     H(^..in,fc  + -^).      /  G  £". 


76) 


and 


Now,  we  have 


ii,s     \r  /"  '■ 

I ,  Jt,  I, 


r,.k  +  Tr 


fc  =  l  •'■'■■* 


z."    =    E„[r|]  =  E 
=     E. 


tdt,      i£S. 


lk=l 

=    E[v,]  Eu 


EW,]  E[vf] 


Lfc=i 


2 


79) 


Note  that  equality  (78)  holds  because,  since  u  is  nonanticipative,  r,  jt  and  t,.;  are  indepen- 
dent random  Ve^iables.  On  the  other  hand,  we  have 


E„[r."'^]     =     E, 


-..*+7,^ 


tdt 


=  E„ 


tf 


=     E„ 

=     EIT^]  E 


EHE[(Tf)2] 


^.*+T,-V 


tdt   I  I/, 


lk=i 


.c   (T,';)\ 


'■fc=i 


+ 


(80) 
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Note  that  equality  (80)  holds  because,  since  policy  u  is  nonanticipativr,  t,j;  and  7,""^.   are 
independent  random  variables.  Hence,  by  (79)  and  (80): 

.r-^ENE[rf]_E.[r;'-^]-iEHE[(7f-)^] 

— EN        =  wn  '    "'  ^'^ 

and  thus  we  obtain: 

We  will  first  show  that  generalized  conservation  law  (26)  holds.   Consider  a  polic.\   t  that 
gives  complete  priority  to  5'^-jobs.  Applying  Proposition  4(a),  we  obtain: 


Jo  Jo  ,^sk=l''^'^ 


tdi 


(T^m  )        sr^    II. 5 


2       +!:'■■  (^3) 

165 
Hence,  taking  expectations  and  using  equation  (82)  and  tiie  definition  of /j(,s')  we  nlirain 

Y^A^z'  =b[S). 

which  proves  that  generalized  conservation  law  (26)  holds. 

We  next  show  that  generalized  conservation  law  (27)  is  satisfied.  Let  thejobs  h(=  selecti'd 
under  admissible  policy  u.  Then,  Proposition  4  (part  (b))  applies,  and  we  can  vvrite 


^0  Jo  f^^  ^tl  Jr,,, 


On  the  other  hand,  we  have 


E^"-'  =  Ztj" 

.G5  ■€>"  i=l       •  * 

>  ZE/ 

=      /        I  dt  -  tdi 

Jo  Jo 


idt  (8o) 

(86) 


>      /       Idt  -  tdi.  (87) 

^0  Jo 

Notice  that  (85)  follows  by  Proposition  4  (part   (c)).  (86)  follows  by  (84).  and  (87)  by 

Proposition  4  (part  (c)).  Hence,  taking  expectations  in  (87),  and  applying  (82)  we  obtain 

Y^A^z^     =     E.[^r|'-^-]  +  ^6,(5) 
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>     E[  /  '"tdt]-E[  /  "■  tdt]  +  yb,{S) 
Jo  Jo  f^s 

=     6(5)  {8S) 

which  proves  that  generalized  conservation  law  (27)  holds,  and  this  completes  the  proof  of 
the  theorem.  □ 

Corollary  3  The  performance  space  for  branching  bandits  corresponding  to  tbt  p(  rfor- 
mance  vector  z"  is  the  extended  polymatroid  base  S(A,b);  furthermore,  the  vetiufs  of 
B{A,b)  are  the  performance  vectors  corresponding  to  the  fixed  priority  rules. 

3.2.2      The  Undiscounted  Tax  Problem 

Let  us  associate  with  a  branching  bandit  process  the  following  linear  tax  structure  A 
holding  tax  C,  per  unit  time  is  incurred  continuously  during  the  stay  of  a  type  /job  in  the 
system.  Let  us  denote 

Vu  '     ("i)  =  expected  total  tax  incurred  under  policy  u,  given  that  there  are  initiall>    ni, 
type  I  jobs  in  the  system,  for  /  £  E. 

The  tax  problem  is  the  following  optimal  control  problem:  find  an  admissible  policy  (/'  thai 
minimizes  Vu  '  (m)  over  all  admissible  policies  u.  In  this  section  we  find  a  closed  forniula 
for  Vu  '  (m)  and  show  how  to  solve  the  problem  usmg  algorithm  ^i  For  that  purpi-)ve 
we  need  some  preliminary  results: 

Expected  System  Times 

Let  Q!"(),  Ij{)  and  x^{-)  be  as  in  Subsection  3.1.  By  definition  we  have 

Q;''(0)=/      E^[Qj{t)\Q{0)  =  m]di,     jeE 
Jo 

and 

i,"(0)  =  E„  [P  lj{t)  d<|Q(0)  =  m] ,     jeE 

From  the  above  formulzis,  it  is  clear  that 

1-   Q'"(0)  is  the  expected  total  time  spent  in  the  system  by  type  j  jobs  under  policy  u. 

2.  ij(0)  is  the  expected  total  time  spent  working  on  type  j  jobs  under  policy  (/    Clearly. 
i"(0)  does  not  depend  on  the  policy  u.  Hence,  we  shall  write  Xj(0)  =  Jj(0). 
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Now,  letting  a  \  0  on  equation  (60)  we  obtain 


where 


^'"^°^  = "  efi  ^"""^''^^  ^  ^  ^i^'""- *'^°^ + ''■'■  •''  ^  ^' 


^^=^'-  21^5)  -^(0)  -  E  E[^u]  (1  -  ^)  -.(0) 


t€£ 


2E[i., 


189) 


(90) 


and  (j:")'(0)  denotes  the  right  derivative  of  z"(q)  at  a  =  0,  that  is: 


(x]f(0))'  =  -E„ 


^0 


dt 


Hence,  we  have 


q: 


r(o)  =  Fi^--;-E  !¥-'."+''.■  j^^ 


E[^'.]  .,£ 


E[r,] 


(92) 


Modelling  and  Solution  of  the  Tax  Problem 

We  have,  bv  (92), 


Vl"'^\rn}     =     ^C.Q-"(0) 


■  6E 


=  .Sl^^-^tF^l-S-- 


(93) 


From  equation  (93)  it  is  straightforward  to  apply  the  results  of  Section  2  to  sohe  the 
optimal  control  problem:  use  algorithm  Ai  with  input  (R,A).  where 


«,  = 


E[v,] 


(9  4) 


Let  7i, . . .  ,7n  be  the  corresponding  generalized  Gittins  indices.  Then  we  have  the  ivsiilt 

Theorem  11  (Optimality  and  Indexability:   Undiscounted  Branching  Bandits)   (a) 
Algorithm  A\  provides  an  optimal  policy  for  the  undiscounted  tax  branching  bandit  probhw 
(b)  An  optimal  policy  is  to  work  at  each  decision  epoch  on  a  project  with  largest  cinrmt 
index  7,. 


3.2.3      Computation  of  A  and  b(-) 

In  this  section  we  compute  the  matrix  A  and  the  set  function  6()  as  follows.  Recall  that 

^'   -    E[r.]   '      '^^■ 
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and 


.65 
From  equation  (65)  we  obtain,  taking  expectations: 

E[Tf]  =  E[vi]  +  J2m.j]E{Tfl     i€S.  (9o) 

Solving  this  linear  system  we  obtain  E[7^'^].  Note  that  the  computation  of  Af  is  niurh 
easier  in  the  undiscounted  case  compared  with  the  discounted  case,  where  we  had  to  solve 
a  system  of  nonlinear  equations.  Also,  applying  the  conditional  variance  formula  to  (63)  we 
obtain: 

Var[7;^]  =  Var[t;,]  +  (E[T/])[,5Cov[(yV.,),es)l  (E[T/])^,5  +  ^  E[.V„]  \ar[7/]. ,  G  5.  (96) 

Solving  this  linear  system  we  obtain  Var[T;'^]  and  thus  E[(7^^)^].  Moreover. 

E[vj]  =  mj+Y,  E[A',;]  E[i^.],     j  €  E.  (97) 

Finally,  from  equation  (69)  we  obtain 

E[T^]  =  Y.m,E[Tfl  (98) 


and 


3.2.4      Stability  condition 


Var[T^l  =  ^m,Var[rr'].  (99) 

t€5 


We  investigate  in  this  section  under  what  conditions,  the  linear  systems  (95)  and  (96)  have 
a  positive  solution  for  all  sets  S  C  E.  In  this  way  we  can  address  the  stability  of  a  blanching 
bandits  process,  in  the  sense  that  the  first  two  moments  of  a  busy  period  of  a  branching 
bandit  process  are  finite.  Let  A''  denote  the  matrix  of  E[N,j]. 

Theorem  12  (Stability  of  branching  bandits)  The  branching  bandiis  process  is  slable 
if  and  only  if  the  matrix  I  —  N  is  positive  definite. 

Proof 

Suppose  I  —  N  IS  positive  definite.  We  will  show  the  system  is  stable.  System  (95)  can  lie 
written  in  vector  notation  as  follows: 

{I-N)sTs  =  vs,  (100) 
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where  7s  =  (£'[TJ^]),gs  Solving  the  system  using  Cramer's  rule  and  expanding  the  dtnor- 
minant  in  the  nominator  along  the  column  us  we  obtain: 

where  ^r  are  nonegative  numbers  (which  are  determinants  themselves)  If  /  —  A'  is  positive 
definite,  then  dei[(I  -  N)s]  >  0  for  all  5  C  £  and  thus  system  (95)  has  a  solution  E{T,'']  >  0 
for  all  I  £  S  and  S  C  E.  Similarly,  (96)  can  be  written  as 

(7  -  N)sxs  =  "S, 

where  15  =  {Var[Tf])t^s  ^rid  ug  >  0.  Therefore,  using  the  same  argument  it  follows  that 
if  7  —  A''  is  positive  definite,  then  ror[T','-^]  >  0.  Hence,  from  (98)  and  (99)  vve  obtain  that 
the  first  two  moments  of  the  busy  periods  are  finite,  i.e  .  the  system  is  stable. 

Conversely,  if  the  system  is  stable,  we  will  show  that  I  —  X  is  positive  definite  Since 
the  system  is  stable  for  all  initial  vectors  rn,  it  follows  that  £"[T,^]  have  finite  nonegative 
values  for  all  i  €.  S  and  S  C  E.  i.e.,  system  (100)  has  a  positive  solution  for  all  .s'  C  E. 
We  will  show  by  mduction  on  \S\  that  del[{I  -  X)s]  >  0  for  all  S  C  E  For  \S\  =  I. 
•^t-^i']  ~  detUl-N)  ]  -^  ^'  ^"^^'ch  implies  that  det[{I  -  A'),]  >  0.  Assuming  that  the  in.liirtion 
hypothesis  is  true  for  \S\  —  k.  we  use  (101)  to  obtain; 

from  the  induction  hypothesis.  Therefore,  I  —  N  is  positive  definite.  □ 

Note  that  the  condition  N  <  I  {I  —  N  positive  definite)  naturally  generalizes  the  stability 
condition  /?  <  1  in  queueing  systems  as  follows:  If  we  interpret  a  queueing  system  as  a 
branching  bandit  then  N  <  I  translates  to  E[N]  =  p  =  ^E[v]  <  1.  since  A'  is  the  numb'^r 
of  customers  that  arrive  (at  a  rate  A)  during  the  service  time  v  of  a  customer. 

3.3     Relation  between  Discounted  and  Undiscounted  Tax  Problem 

In  this  subsection  we  study  the  asymptotic  behaviour  of  the  optimal  policies  in  the  dis- 
counted tax  problem  as  the  discount  factor  a  approaches  0.  and  its  relation  with  the  undis- 
counted tax  problem,  that  corresponds  to  q  equal  to  0.  It  is  easy  to  see  that,  using  the 
notation  of  Subsections  31  and  3.2,  that 

hm  4f,  =.4f,  (102) 
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and 


lima«,,,     =     Um{c,-TE[N,j]Cj]  ,""     '"^ 
o\0         '  a\ol.  ■^     ^     ^^     -'J   1  -  *(a 


pr      ,  =«.•  (103) 

Eh] 

Therefore,  because  of  the  structure  of  the  generalized  Gittins  indices  (see  Proposition  3)  it 
follows  from  (102)  and  (103)  that  the  generalized  Gittins  indices  of  the  undiscouiited  and 
discounted  tax  problem  are  related  as  follows: 

lim  q-),(q)  =  7,.  (104) 

A  consequence  of  (104)  is  that  a  policy  which  is  asymptotically  optimal  in  the  discountHd 
tax  problem  for  a  \  0  will  be  optimal  for  the  undiscounted  problem. 

4     Applications 

In  this  section  we  apply  the  previous  theory  to  several  classical  stochastic  schedniing  prob- 
lems. 

4.1      The  Multi-armed  Bandit  Problem 

The  multi-armed  bandit  problem  was  defined  ni  the  introduction. 

There  are  K  parallel  projects,  indexed  k  =   I K.  Project  k  can  be  in  one  of  a  finite 

number  of  states  i^  E  E^-  At  each  instant  of  discrete  time  <  =  0,  1,  one  can  work  on 
only  a  single  project.  If  one  works  on  project  L-  in  state  iki't)  at  time  t  then  one  receives 
an  immediate  expected  reward  of  R,^(t)  Rewards  are  additive  and  discounted  in  tmio  by  a 
factor  /?.  The  state  ifc(f)  changes  to  ijt(f  +  1)  by  a  Markov  transition  rule  (which  may  depend 
on  k,  but  not  on  t),  while  the  states  of  the  projects  one  has  not  engaged  remain  unchanged. 
i.e.,  j/(t  -I-  1)  =  i/(f)  for  /  ^  k.  Let  P*  =  (p'!j),.j^Ei,  be  the  matrix  of  Markov  transition 
probabilities  corresponding  to  project  k  The  problem  is  how  to  allocate  one's  re.tiources  to 
projects  sequentially  in  time  in  order  to  maximize  expected  total  discounted  reward  over 
an  infinite  horizon.  That  is,  if  jit)  denotes  the  state  of  the  project  engaged  at  time  /.  the 
goal  is  to  find  a  nonidling  and  nonanticipative  scheduling  policy  u  that  maximizes 

E.Ei' «,(,)]■  (103) 

(=0 
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We  model  the  problem  as  a  branching  bandits  problem  in  order  to  apply  the  rcsultj; 
of  the  previous  section.  For  this  reason  we  set  f~"'  =  J.  v,  =  1  We  also  define  iiiatrix 
P  =  (.?>})< j€E  by 

{p'ly     if  '.7  £  Ek,  for  some  k  =  I,.  . .  ,K\ 
0,        otherwise. 
Moreover,  by  (62)  we  obtain: 

^.{o,z, -'„)     =     E[e— -.f  •>.... ^"] 

=    0YI  P^J'J-         ^oTieEi,  (106) 


and,  by  (66) 


By  introducing 


^f(a)     =     <I>.(a,(*f(a))^^,,l5^) 

=     /?{l-^p,j(l -*f(a))},  foriGE.  (107) 

j€5 


1_^ 
1-*,(Q) 


and  noticing  that  since  u,  =  1,  *,(a)  =  /?,  it  follows  from  (107)  that 


and  by  (107)  and  Proposition  6  we  obtain 
Moreover,  since  ^^(a)  =  0, 

where 


1,     if  at  time  t  =  0  there  is  a  bandit  in  state  j\ 
0,     otherwise. 
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The  structure  of  the  matrix  P  =  (pij)  implies  that 

which  implies  that  the  index  decomposition  condition  (17)  holds,  and  therefore  Theorem  3 
applies,  giving  a  new  proof  of  Gittins  theorem: 

Theorem  13  (Gittins  and  Jones  [14])  For  each  project  k  there  exist  indices  {',}i^Ek- 
depending  only  on  characteristics  of  project  k,  such  that  an  optimal  policy  is  fo  engage  at 
each  time  a  project  with  largest  current  index. 

By  the  results  of  Subsection  3. 1.3  we  know  that  the  generalized  Gittins  indices  for  this 
bandit  problem  coincide  with  the  usual  Gittins  indices.  Further,  by  definition  of  generalized 
Gittins  indices,  we  obtain  a  characterization  of  Gittins  indices  as  simis  of  dual  variabiles,  a 
purely  algebraic  characterization.  Also,  note  that  Theorem  13  implies  that  the  multi-armi^d 
bandit  problem  not  only  has  an  optimal  index  policy,  but  it  has  an  optimal  index  poliry 
which  satisfies  the  stronger  index  decomposition  property,  as  described  in  Subsection  2  4 
By  Theorem  6,  the  Gittins  indices  can  be  computed  by  solving  /\'  subproblems.  apjilyiiig 
algorithm  Ai  to  subproblem  k,  with  \Ek\  job  types,  k  =  I, .  ..  .  K  It  is  easy  to  \enfv  thp 
following  complexity  result: 

Proposition  7  The  complexity  of  algorithm  A\  applied  to  svbproblem  A  for  cninpuliiiq  ihc 
Gittins  indices  of  project  k  is 

0(\E,f). 

The  algorithm  proposed  by  Varaiya,  Walrand  and  Buyukkoc  [33]  has  the  same  time  roni- 
plexity  as  algorithm  ^i.  In  fact,  both  algorithms  are  closely  related,  as  we  will  see  next 
Let  tf  be  as  given  by  (108).  Let  rf  be  given  by 

rf  =  «,  +  /?  J^p.^rf.      ieS. 

Let  us  now  state  the  algorithm  of  Varaiya,  Walrand  and  Buyukkoc: 
Algorithm  VWB: 

Step  0.   Pick  TTn  e  argmax  { -\rr  ■  i  ^  E  }\  \et  g„„  =  max{  -^  :  i  £  E  }. 

I  t 

set  Jn  -  {:r„}. 
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Step  k.   For  it  =  1 n  -  1: 

pick  7rn_fc  e  argmax  {  \_^u|i>   ^  '  €  ^V  J„_i};  set  <7^„_^  =  {  ''j^_^u{,\   '■  >  6  E\J„-k}- 
set  y„_i  =  Jn_fc+1  U  [nn-k] 

Varaiya  et  al.    [33]  proved  that  gi,...,gn.  as  given  by  algorithm  VWP,  are  the  Girtiiis 
indices  of  the  multi-armed  bandit  problem.    Let  (Tr,y,u,S)  be  an  output  of  algorithm  A\ 
We  state,  without  proof,  the  following  relation  between  algorithms  A\  and  VWP 

Proposition  8    The  following  relations  hold:  For  j  —  2, .  . .  ,n 


U} i^nlUft}  D  ,--n         .{t, 7r,]  {tTj jr„} 


111) 


and 

r<'^         ft 

^-'=Q,      i£E.  (112) 

and  therefore,  algorithms  Ai  and  VVVB  are  equivalent. 

4.2     Scheduling  Control  of  a  Multiclass  Queue  with  Bernoulli  Feedback 

Klimov  [22]  introduced  the  following  queueing  scheduling  process:  There  is  a  smgl(-~  sppv.m- 
and  n  customer  types.  External  arrivals  of  type  /  customers  form  a  Poisson  process  of  rate  A, . 
for  I  G  £"  =  {1, . . .  ,  n}.  Service  times  for  type  i  customers  are  independent  and  ideniirally 
distributed  as  a  random  variable  i',  with  distribution  function  G,()  When  ser\  ice  of  a 
type  I  customer  is  completed,  the  customer  either  joins  the  queue  of  type  j  customers,  with 
probability  p,j  (thus  becoming  a  type  _;' customer),  or  with  probability  l-XI;g£Pu  lfa\*^sthe 
system.  The  server  selects  the  jobs  according  to  an  admissible  policy  u;  the  decision  e|.-iochs 
are  t  =  0  (if  there  is  initially  some  customer  present),  the  epochs  when  a  customer  arrives 
to  find  the  system  empty  and  the  epochs  when  a  customer  completes  service  (and  some 
customer  remziins  in  the  system).  Let  us  consider  the  following  three  classes  of  admissible 
policies;  U  is  the  class  of  all  nonidling,  nonpreemptive  and  nonanticipative  policies;  Uo  is 
the  class  of  all  nonpreemptive  and  nonanticipative  policies  (idleness  is  allowed);  and  U^  is 
the  class  of  all  nonidling  and  nonanticipative  policies  (preemption  is  allowed). 

Klimov  [22]  solved,  by  direct  methods,  the  associated  optimal  control  problem  over  U 
with  a  time-average  holding  cost  criterion.  Harrison  [18]  solved,  using  dynamic  program- 
ming, the  optimal  control  problem  over  Uo  with  a  discounted  reward-cost  criterion,  in  tiie 
special  case  that  there  is  no  feedback.  Tcha  and  Pliska  [29]  extended  Harrison's  results  to 
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the  case  with  feedback.  They  also  solved  the  control  problem  over  U^ ,  in  tiie  case  that  the 
service  times  are  exponential. 


The  Discounted  Case 

Let  us  consider  the  following  reward-cost  structure:  There  is  a  continuous  holding  cost  C, 
per  unit  time  for  each  type  i  customer  staying  in  the  system,  and  an  mstantaneous  re- 
ward of  /?,  at  the  epoch  of  completion  of  service  of  a  type  /  customer  There  is  also  an 
instantaneous  reward  of  idleness  Rq  at  the  end  of  an  idle  period.  All  costs  and  rewards  are 
discounted  in  time  by  a  discount  factor  a  >  0.  The  optimal  control  problem  is  to  find  an 
admissible  policy  to  schedule  the  server  so  as  to  maiximize  the  expected  total  discounted 
reward  minus  holding  cost  over  an  infinite  horizon.  Let  us  denote  Pu  Pu^  and  Pup  the 
optimal  control  problems  corresponding  to  the  classes  of  admissible  policies  U .  Uq  and  U^ . 
respectively.  We  will  model  each  of  these  problems  as  a  branching  bandit  problem  \W  will 
also  prove,  applying  the  Index  Decomposition  Theorem,  that  m  order  to  sohe  problpiii  P^^ 
we  only  need  to  solve  problem  Pu. 

First,  let  us  consider  problem  Py.  This  problem  can  be  modelled  as  a  branching  bandit 
problem  with  n  job  types,  as  follows:  We  interpret  the  customers  as  jobs  The  descendants 
N,j  of  a  type  J  job  are  composed  of  the  transition  of  the  job  to  another  type  (or  outside  the 
system)  and  of  the  external  Poisson  arrivals.  The  transform  <J>,(.)  is  given  by 


^.(a,2i. 


,zn)     =     E[e 


-av,     N,i 
^1 


~n       J 


=     E 


(l-LPud--';))^""*"^^'^^'^''"-"^' 


je£ 


ieE       (113) 


Also,  by  (66)  and  (113) 


*f(a)  =  {l-^p.,(l-*f(a)}*,[a  +  ^Aj(l-^f(a))].      ieE.  (114) 

Let  x^{a)  =  (ij'(a), . . .  ,rj}(a))-^  denote  the  performance  vector,  as  in  Section  31  We  know 
that  i"(Qf)  satisfies  generalized  conservation  laws.  By  Proposition  6,  the  corresponding 
matrix  Aa  is  given  by 

'^••'*-    l-*.(a)'      '^^- 
Let  us  consider  now  problem  Fii^.  In  order  to  model  the  option  of  idleness,  we  modify  the 

previous  branching  bandit  process  by  adding  an  idling  job  type  ,  which  we  denote  0.   Tlie 
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duration  of  job  type  0,  I'o,  is  exponentially  distributed  with  parameter  A  =  A]  +  +  A^ 
(since  it  models  time  until  the  next  arrival);  the  .V,^,  with  i,j  G  E.  are  as  ui  the  prp\  ions 
case.  jVqo  =  0;  A'o,  =  0  and  ,V,o  ^  0  for  i  G  E  It  is  easy  to  see  that  the  corres|Mjiidiiig 
transform  $,(•)  satisfies 

^,{a,zo,zi,.  .  .,i„)  =  <J>,(q,;i ^n).      '  e  E 

Hence,  it  follows  that 


and 


*f'''"'(a)  =  \l/f(Q),      ieE,      SCE, 


^o'^<°>(a)=^J°^a),      SCE. 


Consequently,  we  have,  for  i  £  S  C  E  that 


and 


-j5u{0}   _       5 


-jSu{o}  _      _  3{o} 

^0,0         —   '   -  -"'0,a  ■ 


Therefore,  condition  (17)  holds,  and  the  Index  Decomposition  Theorem  6  applies    \ow.  we 
have 

E[iV.,]  =  p„  +  A,E[i,], 

and 

^o(o)  = 


A  +  Q 

By  (56) 

*^<  a  J  1  -  *,(a) 

Hence  the  index  of  the  subsystem  composed  of  job  type  0  is  70  =  ^Ro  The  indices 
7,,  for  J  G  E,  are  computed  from  algorithm  ^1  applied  to  problem  P;^  Therefore,  if 
7i  <  •  •  ■  <  7i*-i  <  7o  <  7i*  <  ■  •  7n  then  an  optmial  policy  is  to  serve  customers  of  types 
:'*,...  ,n  with  a  fixed  priority  policy,  giving  highest  priority  to  n,  and  never  serve  customer 
types  1,...,:'  —  1.  That  is,  the  optimal  policy  is  a  modified  static  policy,  as  proved  by 
Harrison  [18]  and  Tcha  and  Pliska  [29]. 
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The  Preemptive  Case 

In  preemption  is  allowed,  then  the  decision  epochs  are  the  arrival  epochs  as  well  as  the 
departures  epochs  of  customers.  If  the  service  time  v,  is  exponential  with  rate  /i,.  for  /  S  E. 
then  it  is  easy  to  model  the  possibility  of  preemption:  model  the  process  as  a  branching 
bandit  process  with  n  job  types.  Job  type  j  has  a  duration  v,  exponentially  distributed  with 
rate  /i,  =  /i;  +  A,  where  A  =  Ai  +  ■  ■  ■  +  A„.  As  for  the  descendants  of  a  t\pe  /  job.  there  are 
three  cases:  (1)  One  descendant,  of  type  _;'  with  probability  ^p,j  (correspondmg  to  the  case 
that  service  of  the  type  i  customer  ends  before  any  arrival  occurs  and  the  customer  moves 
to  queue  j);  (2)  two  descendants,  one  of  type  i  and  the  other  of  type  j  with  probability 
^  (corresponding  to  the  case  that  a  type  j  customer  arrives  before  service  of  the  rypp  / 
customer  is  completed);  and  (3)  no  descendants,  with  probability  ^  (1  -  XI/eEPu)  (<"oi"''?- 
sponding  to  the  case  that  service  of  the  type  i  customer  ends  before  any  arrival  occurs,  and 
the  customer  moves  out  of  the  system). 

The  Undiscounted  Case:   Klimov's  Problem 

Klimov  [22]  first  considered  the  problem  of  optimal  control  of  a  single-server  mnltirlass 
queue  with  Bernoulli  feedback,  with  the  criterion  of  minimizing  the  time  average  hoMing 
cost  per  unit  time.  He  proved  that  the  optimal  nonidling,  nonpreemptive  and  nonantioip-i- 
tive  policy  is  a  fixed  priority  policy,  and  presented  an  algorithm  for  computing  the  prioriti'^s 
(starting  with  the  lowest  priority  type  and  ending  with  the  highest  priority)  Tsoucas  [:32] 
modelled  Klimov's  problem  as  an  optimization  problem  over  an  extended  polymatroul  using 
as  performance  measures 

L"  =  time  average  length  of  queue  i  under  policy  u. 

Algorithm  ^i  applied  to  this  problem  is  exactly  Klimov's  original  algorithm.  A  disad\  antau,e 
in  this  case  is  that  priorities  are  computed  from  lowest  priority  to  highest  priority  Also. 
Tsoucas  does  not  obtain  closed  form  formulae  for  the  right  hand  sides  of  the  e.xfe'iidid 
polymatroid,  so  it  is  not  possible  to  evaluate  the  performance  of  an  optimal  policy  Our 
approach  gives  explicit  formulae  for  all  of  the  parameters  of  the  extended  polymatroid  and 
also  explains  the  somewhat  surprising  property  that  the  optimal  priority  rule  does  not 
depend  in  this  case  on  the  arrival  rates.    The  key  observation  is  that  an  optimal  i^oliry 
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under  the  time  average  holding  cost  criterion  also  minimizes  the  expected  total  holding  cost 
in  each  busy  period  (see  Nain  et  al.  [24]  for  further  discussion).  Now.  ue  may  niorUl  the 
first  busy  period  of  Klimov's  problem  as  a  branching  bandit  process  with  the  undi.scouiU''d 
tax  criterion,  as  considered  in  Section  3. 

Assuming  that  the  system  is  stable,  we  apply  the  results  of  Subsection  3.2    \\t^  dt^fine 
|i,  =  E[t',]  and  tf  =  E[Tf].  By  (65)  we  have 

tf  =  ^i^  +  Y.^P,J  +  ^i^XJ)r^.    i€E,  (II.3) 

which  in  vector  notation  becomes: 


I.e., 


and 


tgc  =  (/s^  -  Ps'^.S^  -  P5<:-^5c)       /J^^ 


tf  =  ^is  +  (Ps.s^+^is^lc)tfc 


After  algebraic  manipulations  we  obtain 

tf  =  (..  +pL-  (/5c  -  P5c,50-'  M5c)  ^  J''^%  -  ^^'-^-'^  ^,   .    >es.        (iri) 

^  ^  det(/5c  -  Psc,s-=  -  fiS'=>^sc) 

Therefore,  by  definition  of  Af  in  (73)  we  find  that  Af  =  tf^ /fi,,  for  ;  £  S.  while  6(5)  is 
given  by  (74).  Now,  letting 

^    ^  detjlsc  -  Psc,s^) 

det(/5c  -  Psc^c  _p5cA^c)' 

we  may  define  Af  =  Af /Ks,  and  6(5)  =  6(5)/A's,  thus  eliminating  the  dependencp  on  the 
arrival  rates  of  matrix  A.  As  for  the  objective  function,  we  have  by  (93) 

yiO,C)  =  J2\^'~  ^''''''  ^'  I  ^-r  -  HE)  Y,  C,\  +  X:  C,h,.  (117) 

Hence  the  problem  can  be  solved  by  applying  algorithm  Ai  with  input  {R,  A),  where 

and  since  {R,  A)  do  not  depend  on  the  arrival  rates  neither  does  the  optimal  policy.   Note 
that  as  opposed  to  Klimov's  algorithm,  with  this  algorithm  priorities  are  computed  from 
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highest  to  lowest.  This  top-down  algorithm  weis  first  proposed  by  Nain  el  al.  [24].  vvlio 
proved  its  optimality  using  interchange  arguments.  Bhattacharya  ei  al.  [3]  provided  a 
direct  optimality  proof.  Nain  et  al.  proved  that  the  resulting  optima!  index  rule  i^  also 
optimal  among  idling  policies  for  general  service  time  distributions,  and  among  preemptive 
policies  when  the  service  time  distributions  are  exponential.  It  is  also  easy  to  verify  these 
facts  using  our  approach  (in  particular,  the  index  of  the  idling  state  is  0.  whereas  all  other 
indices  are  nonnegative). 

Moreover,  in  the  case  that  the  arriving  jobs  are  divided  into  K  projects,  where  a  t\pe  Ic 
project  consists  of  jobs  with  types  in  a  finite  set  Ek,  jobs  in  Ek  can  only  make  transitions 
within  Ek,  and  E  is  partitioned  as  E  =  Uj^^^Ek,  then  it  is  easy  to  see  that  the  Index 
Decomposition  Theorem  6  applies,  and  therefore  we  can  decompose  the  problem  into  K 
smaller  subproblems. 


4.3  Multiclass  Queueing  Systems 

Shantikumar  and  Yao  [26]  showed  that  a  large  variety  of  multiclass  queueing  systems  satisfy 
strong  conservation  laws.  The  reader  is  referred  to  their  paper  for  a  list  of  particular  systems 
and  performance  measures  that  satisfy  strong  conservation  laws.  All  their  results  correspond 
to  the  special  case  that  the  performance  space  B{A,b)  is  a  polymatroid 

4.4  Job  Scheduling  Problems  without  Arrivals;  Deterministic  Scheduling 

There  are  n  jobs  to  be  completed  by  a  single  server.  Job  /  has  a  service  requa-ement  dis- 
tributed as  the  random  variable  n,,  with  moment  generating  function  ^,().  It  is  immediate 
to  model  this  job  scheduling  process  as  a  branching  bandit  process  in  which  jobs  have  no 
descendants.  Let  us  consider  first  the  discounted  case;  For  a  >  0  it  is  clear  by  definiticm 
of  Af,^,  in  (36),  that  ^4^^  =  1,  for  i  £  S.  Therefore  the  performance  space  of  the  vectors 
i"(a)  studied  in  Section  3  is  a  polymatroid.  Consider  the  discounted  reward-tax  problem 
discussed  in  Section  3,  in  which  a  instantaneous  reward  Ri  is  received  at  the  completion 
of  job  I,  and  a  holding  tax  C,  is  incurred  for  each  unit  of  time  that  job  ;  is  in  the  system. 
Rewards  and  taxes  are  discounted  in  time  with  discount  factor  a.  By  (56)  it  follows  that 
the  generalized  Gittins  index  for  job  i,  in  the  problem  of  maximizing  rewards  minus  taxes. 
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IS 


C,  T     a*,(a) 


^.(-)  =  {«'  +  -}f^  (US) 

''  Q   J    1  -  W,{a) 


Let  us  consider  now  the  undiscounted  case  in  the  case  without  rewards  By  definition  of 
Af  in  (73)  we  have  Af  =  1,  for  i  ^  5.  Hence  the  performance  space  of  the  performance 
vectors  z"  studied  in  Section  3  is  also  a  polymatroid.  Thus  by  equation  (93)  it  follows  that 
the  generalized  Gittins  index  for  job  a  in  the  undiscounted  tax  probleni  is 

7.  =  i^,  (HO) 

E[i;,] 

thus  providing  a  new  polyhedral  proof  of  the  optimality  of  Smith's  rule  (see  Smith  [27]) 

In  the  case  that  there  are  precedence  constraints  among  the  jobs  that  form  out-trees, 
that  is  each  job  can  have  at  most  one  predecessor,  it  is  easy  to  see  that  the  problem  can 
also  be  modeled  as  a  branching  bandits  problem  and  thus  solvable  using  the  theory  we  have 
developed  in  Section  3. 

5      Reflections 

We  presented  a  unified  treatment  of  several  classical  problems  in  stochastic  and  dynamic 
scheduling  using  polyhedral  methods  that  leads,  we  believe,  to  a  deeper  understanding  of 
their  structural  and  algorithmic  properties.  Perhaps  the  most  important  idea  we  used  is 
to  ask  the  question:  What  is  the  performance  space  of  a  stochastic  scheduling  problem'^ 
We  believe  that  the  approach  of  characterizing  the  feasible  region  of  a  stocha-stic  scheduling 
problem  will  lead  to  important  new  insights  and  methods  and  will  bridge  the  artificial 
gap  between  applied  probability  and  mathematical  optimization.  Indeed,  we  hope  that 
our  results  will  be  of  interest  to  applied  probabilists,  as  they  provide  new  interpretations. 
proofs,  cdgorithms,  insights  and  connections  to  important  problems  in  stochastic  scheduling. 
as  well  as  to  discrete  optimizers,  since  they  reveal  a  new  fundamental  structure  (extendf^d 
polymatroids)  which  has  a  genuinely  applied  origin. 
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